INTRODUCTION
With the advent of high-throughput technologies genomic science has experienced great leaps, rapidly expanding its domain beyond the characterization of short genomic reads in the early days of sequencing to the possibility of obtaining personalized genomes, once considered the holy grail of genomic methodology and technology development. The value of personalized genomic analysis, and evaluation of variant associations to disease is becoming more apparent, even spurring directly to consumer implementations. Further developments in the last few years now lead to more ambitious goal: the longitudinal monitoring of multiple omics components in individuals and the characterization of the molecular changes associated with disease onset in individuals, at an unprecedented level. In this review we describe technological and methodological developments in personal genomics, and the new promise of multiple omics profiling, including transcriptomes, proteomes, metabolomes, autoantibodyomes and so forth, (sample omics analysis workflows shown in Figures 1-4). We then discuss a framework on how such data may be integrated with a view towards the application of a personalized precise and preventive medicine, and describe an implementation of this approach. The technological developments and methodology allow for inroads into the future of quantitative personal medicine, which we can now plan carefully by taking into account not only the scientific developments that need to be implemented, but also the social implications coupled to ethical and legal considerations.
GENOMIC SEQUENCING
In 2001 the completion of the Human Genome Project (HGP) was announced effectively with the publication of the first complete human genome sequence. The HGP came at a hefty $2.7 billion cost using the best technology of the time, making it seemingly prohibitive to expect personal genome sequences to be achieved shortly thereafter. Yet the immense technological advancement, spurred by motivation by the National Institute of Health (NIH) and the National Human Genome Research Institute (NHGRI) to bring down genomic costs, led to an unprecedented growth in technology and methodology, enabling the drop in sequencing costs (http://www.genome.gov/sequencingcosts) to continue at a rate beyond the most optimistic projections of 2001 (<$4000 currently). While initially the human genome was a combination of multiple individual genomic data [
1-
3], the developments by 2008 had allowed the determination of genomic individual makeup [
4-
7]. It is now made possible personalize Whole Genome Sequencing (WGS), and the dwindling sequencing costs promise the possibility of affordability for all in the near future [
8]. These developments encouraged efforts to characterize of disease on a genomic level, and the application of an all-encompassing genomic medicine, at the molecular level. The initial goals were the characterization of populations for large studies, now shifting to the individual.
Multiple technologies development/dropping costs
The HGP relied on technology using Sanger-based capillary sequencing technology [
1] with an estimated production of 115k base pairs per day (kbp/day) [
9]. The NHGRI spurred the development by encouragement through the $1000 genome program (http://www.genome.gov/11008124- al-4), leading to the industry development of multiple massively parallel [
10] sequencing platforms (e.g., Roche/454, based on pyrosequencing [
11–
13]; Life Technologies SOLiD [
14-
16]; Illumina [5,6]; Complete Genomics based on DNA nanoball sequencing [
17]; Helicos Biosciences [
18]; and recently single molecule real-time technology [
19,
20] by Pacific Biosciences). These next generation sequencing platforms are now being supplemented but what has been termed as third-generation sequencing, [
21], including such nanopore technologies as announced early in 2010 by Oxford Nanopore Technologies [
22]. The technological developments and competition resulted in a drastic and continuing drop in sequencing cost, processing times and exponential increases number of reads produced.
An alternative to sequencing the whole genome has been whole exome sequencing (WES) [
23]. This technology aims to study the exonic regions of the genome (~2%–3%), which are associated to several Mendelian disorders. It offers a lower cost option (e.g., Illumina, Agilent, and Niblegen platforms, see Clark et al. for a comparison of the latter two [
24]) and has received immense attention, including the Exome Sequencing Project (ESP) (see the Exome Variant Server at http://evs.gs.washington.edu/EVS/), supported by the National Heart, Lung and Blood Institute (NHLBI).
Quantitating genomic variation
Concurrently with the technological developments, our understanding of the human genome has grown immensely since the publication of the reference genome in 2003. The aim was to determine the precise role of each base in the genome and identify genomic variants (Figure 1). Several collaborative large-scale efforts pursued such investigations. The International HapMap Consortium [
25,
26] tried to identify common population variants and led to the development of public databases, such as dbSNP [
27] (http://www.ncbi.nlm.nih.gov/SNP/), which catalogues Single Nucleotide Polymorphisms (SNPs) (defined as occurring in>1% of the population to differentiate from Single Nucleotide Variants (SNVs)). This has revealed great genomic variation both in global populations [
28,
29] and populations of admixed ancestry [
30-
33].
Typically the technologies involve the assignment of reads to the reference genome to determine the structure of the underlying sequence, including variation (Figure 1). Beyond nucleotide variation, other genomic differences have been investigated, including small insertions and deletions (indels), copy number variations (CNVs) indicating varying numbers of segments and longer chromosomal segments that contribute to Structural Variation (SVs) - SVs are defined for segments of chromosomes larger than 1000 bp (Figure 1A). Such efforts have been based on microarray methodology [
34-
37] higher-resolution in structural variants may be achieved with other methods [
38-
41]. Structural variants have been publically made available in the database of Genomic Structural Variation (dbVAR; http://www.ncbi.nlm.nih.gov/dbvar/).
Furthermore, functional elements have been extensively catalogued by the Encyclopedia of DNA Elements consortium (ENCODE; http://genome.gov/encode ~10 production projects), with funding from the NHGRI. ENCODE data, including regulatory elements and RNA and protein level elements, has now been released and the project has received widespread attention [
42-
45]. The ENCODE project aims at a biochemical genomic characterization, with a thorough mapping of transcribed regions, transcription factor binding sites, open chromatin signatures, chromatin modification and DNA methylation. Such extensive data still needs to be annotated [
46] interpreted in terms of biological significance, mechanisms and connections to phenotype and will likely prove invaluable in our interpretation of personalized genomic differences.
Though initially limited by the number of complete genomic sequences, such data are now continuously updated and expanded by information from other projects such as the 1000 Genomes Project [
47] as discussed below, which has allowed us to have a better view of the great variability in each individual genome (~3-4 × 10
6 SNPs,>200000 SVs of varying sizes, ~1500 SVs>2 kbp), with much of the variation considered rare (1%-5%). Genome-Wide Association Studies (GWAS) try to associate the common variants to disease, by combining the now readily available extensive variant information and allelic variability, with linkage disequilibrium (a description of the correlation patterns between proximal variants). The NHGRI provides a publically available catalogue of published GWAS (http://www.genome.gov/gwastudies) [
48]. The early expectations of finding common traits and genomic features unique to diseases has proven more complicated, as the genomic variability turns out to be higher than expected and additionally the genetic variants need further validation.
Use of WGS and WES has been successful in the identification of somatic mutations. Mendelian disorders including neurological disorders, and cancer have been characterized using WES [
49–
58], including some recent single-cell studies [
59,
60]. Genomics may help classifying cancer subtypes, and possible treatment, and such research is at the center of WGS, with projects such as the Cancer Genome Atlas [
61] (http://cancer-genome.nih.gov/), and the International Cancer Genome Consortium (http://www.icgc.org). Additionally, cancer specific public databases already are available [
62], including a cancer cell line encyclopedia [
63] and genome characterization has been carried out, for example in ovarian cancer [
61], melanoma [
64], lymphocytic leukemia [
65], breast cancer [
66-
69] and acute myeloid leukemia (AML) [
70,
71].
Personalized risk evaluation
One of the goals of personalized genome interpretation is the evaluation of disease risk factors based on an individual’s variant and allelic distribution composition. Such information may be compared to similar individuals with known disease associations to assess whether an individual shows increased or decreased risk compared to the control group. A combination of know SNPs and personalized variants has been found to be effective [
72-
75] and has been used in clinical studies; more recently, a seminal study by Ashley et al. [
76] evaluated disease risk for a patient with family history of vascular disease.
Personalized evaluation of potential drug responses can be based on the effects of variants [
77,
78], including drug selection, sensitivity and dosage estimation, e.g., cardiovascular drugs [
79], schizophrenia related medications [
80]. For example, PharmGKB (http://www.pharmgkb.org) provides a curated database of possible genomics information [
81,
82], exploring the impact of genomic variation on drug responses as these relate to expressed genes and associated pathways and disorders. The future applications are to include a precise drug dosage for an individual, avoiding trial and error methods and providing more effective treatment.
The evaluation of personalized risk based on genomes is now appearing in direct-to-consumer services. Companies like 23andMe deCODEme, (and previously Navigenics), offer to assess individual genotypes and offer disease based interpretation services based on Mendelian disorder evaluation and including pharmacogenomics responses. These are mostly based on SNPs evaluation and the tests though limited in scope do offer interpretation attractive to multiple consumers.
Personal Genomes Project
Presently thousands of genomes have been completely sequenced. One of the first large scale projects has been the 1000 Genomes Project [
47], that has made its data publically available, and has encouraged the development of streamlined bioinformatics tools to analyze the variation in the individual genomes (Figure 1). This project aims to combine data from 2500 individuals from multiple populations, at a 4 × coverage.
Another grand scale effort driven by George Church’s group at Harvard University is the Personal Genome Project (PGP) [
83-
85]. The project has been recruiting individuals who can share their medical and other information together with genomic information online (http://www.personalgenomes.org). The volunteers share full DNA sequences, RNA and protein profile information in addition to extensive phenotype information including medical records and environmental considerations, with all the data made publically available, and plans to expand to 1000000 individuals [
86]. One of the rather unique features of the PGP project is that it differs in consent of participants as compared to traditional studies. The ownership of the data is to be open and publically available without restrictions, not only for the initial perspective of the study, but open to follow-up or additional investigations. The scope is participatory, with the volunteers for the project interacting directly with the researchers. To address informed consent, participants pass a basic genetic literacy exam and must understand the project’s scope. Additionally, they provide complete medical history, immunization and medications history, which becomes part of the publically available subject information. The access to the individual’s data in the project can be either private to the participant and researchers only or completely public, depending on the participant’s choice. The availability of extensive patient and omic information will be invaluable to researchers in developing robust analysis models for characterizing genomes and disease and the PGP project, and its publically open structure model, will be at the forefront of such efforts.
BEYOND THE GENOME: OTHER OMICS
Transcriptomics
Thought the genetic code in DNA is the almost identical (besides cellular variation), different cells have different gene expressions, corresponding to the kind of cell, developmental stage and physiological state. The collection of the transcripts in a cell (e.g., mRNA, non-coding RNA and small RNAs), the transcriptome, is essential in our understanding of cell function, and response to disease. Consideration must include start and end sites of genes, and coding, alternative splicing and post-transcriptional modifications.
Initially inroads were made using high-density oligo microarrays, and in-house custom made microarrays [
87], with high-density arrays having resolutions up to 100 bp [
88-
91]. While relatively inexpensive, these methods suffered from relying on prior knowledge of the genome, and faced technical issues such as background and saturation effects [
92]. Hybridization interactions between probe sets in short oligo microarrays lead to spurious correlations [
92,
93].
The development of RNA sequencing (RNA-Seq) brought higher coverage, better precision and quantitation, and higher resolution and sensitivity, bringing RNA-Seq technology and transcriptomics on par with genomic sequencing [
94-
98]. RNA-Seq considers reads that correspond to millions of transcriptomic fragments that are mapped to the reference genome, to provide information on transcripts that may not be in the existing genomic annotation, allowing the search for novel transcripts, and even identification of SNPs and other variants, while showing remarkable reproducibility (Figure 2). Transcriptome profiling has included looking at cancers [
99-
101], including breast cancer [
102], gastrointestinal tumors [
103] and prostate cancer [
104].
Mass spectrometry, proteomics and metabolomics
Gene expression was expected to correlate with protein levels in a cell and it was thought that methods such as RNA-Seq would be enough to ascertain the proteomic expression corresponding to gene expression. Proteins are expected to be closer to phenotype, as they participate in every aspect of cellular biology, but their expression levels are difficult to quantitate, partly because of translational control in cells, possible degradation and sampling issues [
105-
107]. The development of electrospray ionization brought mass spectrometry (MS) to the field of proteomics and the possible identification of thousands of molecules based on mass [
108-
112]. This has enabled not only the cataloguing of proteins, but also querying post-translational modifications [
113,
114]. As the techniques matured, liquid chromatography tandem mass spectrometry (LC-MS/MS) has become standard, and novel instruments (e.g., Velos family [
115] by Thermo Scientific; quadrupole time-of-flight mass spectrometers (QTOFs) by Agilent) allow unprecedented precision to enable the development of methods to identify thousands of proteins (~4000–6000 over 2 days), and quantitate protein levels [
73,
116] (Figure 3). One set of methods uses stable isotopic labeling by amino acids in cell culture (SILAC) to label cell in light and heavy isotopes of amino acids providing double spectral peaks in MS for identification and quantitation [
117-
120] - this method is now supplemented by ‘spike-in’/‘super’ SILAC which has been used to measure biopsy tumor proteomes [
121] Another possibility is to use isobaric tags for relative and absolute quantitation (iTRAQ) [
122,
123] or tandem mass tag (TMT) labeling [
73,
124,
125], and other methods, including spiking in peptides for absolute quantitation. Finally, it is possible to employ label-free methods for quantitation, which do not rely on tags, including integrating signal methods and MS spectral counting [
126-
131].
In comparison to whole transcriptome profiling, the numbers of proteins identified in proteome profiling tend to be less in comparison, particularly since low peptide levels cannot be amplified (
cf. polymerase chain reaction methods for sequencing methods). Additionally, the current bottom-up (shotgun) proteomics methodology uses digestion with endopeptidases such as trypsin to obtain peptides of small enough mass to be identified by MS/MS, resulting in many fragments that cannot be identified in MS, which may possibly be alleviated by top down approached that do not employ a digestion step [
132-
136]. However, proteomics provides insights that are missing from transcriptomic analysis, especially given the low correlations between protein and transcriptome differential gene expressions [
73,
137-
142].
Multiple proteomes have been quantitatively profiled, including characterization of ovarian cancer [
143], an integrated approach that combines transcriptome and proteome information in a human cancer cell line by Nagaraj et al. [
144], integrative gastric cancer characterization and effects of post-translational modifications [
145], and looking for biomarkers in other cancers [
146,
147].
In addition to developments in proteomics, MS has encouraged the study of small molecules. The behavior of small molecules in cells though difficult to track provides insight into many common disorders. The set of all cellular small molecules, collectively called the metabolome Metabolic processes are vital in biological pathways and a systems analysis of molecular cell complexity might lead to biomarker discovery, and possibly disease risk assessment, diagnosis and treatment [
148]. Similar to proteomics, metabolomics can employ mass spectrometry to identify compounds [
149] (Figure 4) and cataloguing is under way, with thousands of metabolites identified by structure, mass and occasionally associated biological processes [
150-
161]. The identification of compounds can be based on MS/MS application and use of known compound spectra, or via use of standards against which mass spectra are compared. The profiling of metabolic components on an individualized basis can provide insights into pharmacogenomics and personalized medications, in addition to potential biomarkers, for example cholesterol levels and coronary artery [
162,
163]. The metabolomics of cancer has been extensively studied [
164-
166] and Type 2 Diabetes has been investigated [
167], and
in vivo interactions with proteins are being evaluated [
168].
Other omics
Genomes, transcriptomes and metabolomes have received widespread attention and currently offer the most quantitative data, provided by robust and comprehensive omics technologies, both in terms of experimental, as well as computational methodology. However multiple other omics are available, and these numbers are increasing, with a few notable technologies mentioned below:
• Autoantibodyomes: In addition to profiling of proteins directly, the reactivity of proteins to autoantibodies may be profiled on a large scale. Spotted protein arrays [
169-
173] have been implemented to study for example effects in cancer [
174], immune response [
175] and recently diabetes [
176]. Another approach is the Nucleic Acid Programmable Protein Array (NAPPA) constructed by spotting plasmid DNA to effectively express and code the proteins on the array and used for immunoprofiling [
177,
178]. Furthermore functional peptide arrays have also been constructed [
179,
180]. Complementary technologies such as bead-based immunoassays are also being actively developed, such as the Luminex xMAP assay [
181].
• Microbiomes: Omics profiling could also include mapping of the personal microbiome, the complete set of microbes in an individual (e.g., found mainly on the skin or in the gut, conjunctiva, saliva and mucosa) using possibly a combined omics approach to look at genetic makeup and metabolic components [
182-
187]. The human microbiota (http://www.human-microbiome.org) have been associated to obesity [
188] and diabetes [
189,
190] and have also been suspected to play an active role in the development of immunity [
191]. The dynamic monitoring of microbiome-related changes can help identify the specific microbiota involved in disease responses, elucidate microbiome-host interactions and how the individual variability in components impacts developmental and metabolic processes.
• Methylomes: In addition to genomics, epigenomic information, such probing the methylome, i.e., identifying all genomic sites of cytosine methylation [
192,
193], might provide information about differentiation and regulation of gene expression. Methylation analysis and data interpretation can be challenging [
194,
195] but methods are improving as more data becomes available. Methylome analysis has now been carried out in blood components [
196], stem cells [
197] and ovarian cancer [
61], and it might prove invaluable in assessing epigenomic effects on individual development and health.
PERSONALIZED MEDICINE
The developments of the many different omics technologies outlined above have given us tremendous insight into the human genome and associations to diseases, especially with the rise of the personal genome. The NHGRI recognizing the importance of these developments and the directions necessary to enhance health care, outlined in 2011 a vision for the future of personalized medicine [
198] encompassing five domains of development that included understanding the structure of genomes, their biology, improving our understanding of the biology of disease, advancing medicine and improving the effectiveness of healthcare. The aims had been set to a shift towards personalized medicine within two decades, but the availability of the technology and constant decreasing costs have made pilot investigations of personalized medicine a current possibility [
73]. Genetic variation has proven adequate for understanding group differences in disorders, but a truly personalized implementation needs to consider an individual. Clinicians are already considering molecular markers in their evaluation of patients, and particularly cancer [
199-
203]. The typical clinical diagnosis involves the observation of symptoms traditionally confirmed utilizing a small set of molecular markers. In diseases that share a common set of symptoms, some rare, such diagnosis is often complicated and prolonged, especially for heterogeneous disorders that need additional information to enable classification and subsequent specific treatments. Genetic and environmental factors create additional variability in disease severity, progression and treatment responses. Thus, traditional assays together with the aforementioned current omics technologies, that allow monitoring of thousands of molecular components, will facilitate and accelerate differential diagnostics and sub-classification through utilizing a more complete set of disease markers. A personalized approach will result in better targeting of diseases, introduce higher precision through measurement of larger sets of molecular components and ideally implemented at an early age to assess disease risk and have a preventative rather than retrospective treatment focus.
A personal approach is by its nature an
n = 1 study, which helps eliminate variation between individuals that are treated as a group, but still requires some verification and establishment of a baseline for comparison. As such, the profiling of healthy physiological states in a longitudinal approach may provide such a basis, if multiple time points with similar physiological state makeup are sampled. Multiple omics can supply multiple supporting datasets at each time point, with each complementary technology providing additional supporting information for a baseline establishment. This introduces the concept of complete omics monitoring of individuals over time, making personalized medicine a more dynamic proposition. The dynamic changes of molecular components may be associated to the individual’s changing physiological states, and mapped onto pathways to identify the onset and progression of disease, including possible preventive measures. In our suggested implementation, termed integrative Personal Omics Profiling (iPOP) which we followed in the study discussed below [
73] we integrate the omics components discussed above in a longitudinal approach with three essential steps (Figure 5):
I)
Risk estimation: As discussed above the personal and common genomic variants determined in an individual genome can be associated to disease [
76], with pharmacogenomic evaluation to determine possible drug response. An early age whole genome sequencing, possibly at birth, can provide a list of possible increased risk disorders and lead to taking preventive measures. This may be done in combination with a complete medical and family history, as for example implemented in the PGP project, and in conjunction with classical clinical risk factor profiling.
II)
Dynamic profiling of multiple omics: Starting with a healthy or ‘steady state’ baseline, by monitoring changes in the molecular components over multiple time points, drastic or gradual changes in physiological states might be assessed and the dynamic onset of disease profiled, and possibly prevented. Such profiling may be done on blood components, which are easily obtainable currently in the clinic. The individual blood components are excellent reflectors of generalized physiological state of an individual, as the blood circulates and receives inputs from multiple tissues throughout the body. The components may be processed to track multiple omics, such as transcriptome, proteome, metabolome and autoantibodyome, etc., which as mentioned offer complementary information, especially given the modest correlation observed between transcriptomic and proteomic components [
137-
142]. A recent study of profiles of tumors changing over time also employed an integrative approach on genomic and transcriptomic components [
204]. Implementing this monitoring on healthy individuals will allow the monitoring of disease onset and physiological changes from various healthy, disease and recovery states, and following thousands of molecular component levels and responses at corresponding physiological states.
III)
Data integration and biological impact assessment: The multiple omics data can be analyzed individually to characterize their temporal response profile. This may be done using standard statistical time-series analysis, extensively used in all quantitative disciplines, such as physics, economics and finance, as discussed by Bar-Joseph et al. [
205]. The dynamic signature of the signals for each molecular component can be studied for autocorrelation, periodicity or spikey behavior, corresponding to causal changes or abnormal physiological state conditions corresponding to onset of disease, infections, or environmental effects. The different classes of temporal response can be checked for biological pathway and gene ontology enrichment [
151,
157-
161,
206-
210], and corresponding disease associations in comparison to a database of other longitudinal profiles (coupled to complete electronic records of omic and medical histories). Such a database is a necessary and powerful resource towards the realization of personalized medicine based on omics data profiling.
Example implementation of personalized medicine: iPOP
To show the feasibility and practical applicability of iPOP we profiled a healthy individual, 54, over a period of initially 14 (now 33) months [
73]. This initial time series covered healthy states, and two viral states, including a human rhinovirus (HRV) infection at the initiation of the study and a respiratory syncytial virus (RSV) infection 289 days later. The iPOP used blood samples to extract omic components from peripheral mononuclear blood cells (PBMCs) and serum, which were analyzed to obtain a complete DNA, RNA, protein, metabolite and autoantibody profile. Initially a complete medical exam was performed with standard clinical tests before time-point profiling began. In a first step, WGS with two platforms was carried out (Complete Genomics and Illumina, at 150- and 120-fold coverage respectively) and WES with three platforms (Nimblegen, Illumina and Agilent) and helped identify a large number of variants (>3 × 10
6 SNPs;>2 × 10
5 indels;>2000 SVs). Using multiple platforms allowed us to determine high-confidence and novel variants (using HugeSeq [
211]). Evaluation of genetic disease risks based on variants was carried out, both by looking for known disease associations using dbSNP and the Online Mendelian Inheritance in Man (OMIM, http://omim.org/) database and using the RiskOGram algorithm [
76] which integrates information from multiple alleles to assess risk against a similarly matched data cohort. This revealed significantly increased risk for various disorders, including open angle glaucoma, dyslipidemia, coronary artery disease, basal cell carcinoma, type 2 diabetes (T2D), age related macular degeneration and psoriasis. This encouraged the subject to follow up on these disorders, and also start monitoring glucose and glycated hemoglobin (HbA1c) levels, which surprisingly increased beyond normal levels following the RSV infection, and the subject was diagnosed by his physician for T2D 369 days into the study. Related to T2D, pharmacogenomic considerations revealed a possibly favorable (glucose lowering) response to diabetic drugs rosiglitazone and metformin, should treatment become necessary. Furthermore, the autoantibodyome profiling of the subject (Invitrogen ProtoArrays profiling of 9483 protein reactivities to Immunoglobulin G (IgG)) revealed increased reactivity in multiple proteins, including DOK6 (related to insulin receptors), and GOSR1, BTK and ASPA, previously reported to show high reactivity by Winer et al. in insulin resistant patients [
176]. The subject initiated and still maintains a strict dietary and exercise regiment supplemented with low doses of acetylsalicylic acid, which helped him control his glucose and Hb1Ac levels, which after a considerable time period (~months) have now returned to normal levels.
In addition a range of omics were profiled over time for up to 20 different timepoints over the span of the study including high coverage transcriptome (RNA-Seq of PBMCs, 2.67 billion reads mapped to 19714 isoforms corresponding to 12659 genes), proteome (MS of PBMCs, identifying a total of 6280 proteins; 3731 consistently across most timepoints), metabolome (MS of serum, profiling 6862 and 4228 metabolites during periods of HRV and RSV infections respectively, with ~20% identified based on mass and retention times alone). The dynamic transcriptome, proteome and metabolome profiles were analyzed in a novel integrated framework based on spectral analysis of the time series. This allowed the identification of temporal patterns in the combined data, corresponding to biological processes that varied with physiological state changes, including the onset of T2D seen in multiple omics components, and common signatures of HRV and RSV infections. While several gene associations to pathways were known, multiple genes showed similar patterns that had not been reported before and merit further investigation.
OTHER CONSIDERATIONS AND FUTURE DIRECTIONS
The iPOP study discussed above revealed the complexities and characteristics of personal genomes, transcriptomes, proteomes and metabolomes and showed the feasibility of personalized longitudinal profiling that can provide actionable health information. Multiple omics data integration still presents a formidable challenge and merits further development. Each omics technology produces different kinds of data, including multiple formats (e.g., data files range from simple text, and extensible markup, e.g., .xml, to vendor closed-source formats). Additionally, each omics set requires its own quality control analysis, further confounded by different error and noise levels associated to the different technologies. As each of the data sets also presents different signal and noise distributions, this makes uniform normalization approaches across omics challenging, especially if considering multimodal dynamic data. Furthermore, the amounts of information per omics set can vary, e.g., ~5000 proteins, ~20000 transcript isoforms, ~6000-10000 metabolites, ~9000 autoantibody-protein reactivities and so forth. Hence, gene-centric approaches, that integrate data corresponding to, associated or interacting with the same genes, will not always work, as the different components may not match. The integration of information per component is made more difficult with multiple existing gene and protein annotations, often resulting in a many-to-many map in the gene-protein integration, and correspondingly lacking metabolite-protein/gene annotations and associations. Finally, if considering dynamic datasets, this also results in multiple instances where time points might be missing data for some of the molecular components (especially evident in mass spectrometry and shotgun proteomics, where proteins are identified through different peptides). These complications of omics data integration necessitate that each individual omics data set is analyzed independently up to normalization, and then integrated with the other information. New integrative methodology has to account for such different normalizations, missing data, and also integration that is not gene-based, but rather incorporates time-series analyses, as for example was carried out in the iPOP study [
73]. Classification of changes by temporal response, and possibly interaction data leads to an interpretation of components based on shared similar dynamics and avoids some of the issues of insufficient annotations and missing information. Such an interpretation lends itself to a clinical setting where dynamic changes are associated to varying personalized physiological states, and may be adopted by the medical community.
To facilitate the wide adoption of the methods into personalized medicine, the integrated data analysis will require optimization of current computational tools to rapidly and efficiently handle as well as visualize the multiple omics data. As a first step, the amount of computation time for different analyses must be reduced from days (in the case of mapping sequence data and quantitative proteomics in current omics analyses presented above) to hours or less to have immediate relevance to active medical examinations. Secondly, better visualizations of omics data, though difficult, are also necessary, as multidimensional information is difficult to collate, present, and interpret (many efforts are addressing this, e.g., Circos plots that allow multiple sequence information to be displayed together are now widely adopted [
212]). Incorporating such information with clinical data and phenotypes presents a new challenge, requiring browsers that combine temporal information with multi-dimensional omics sets. We believe network analysis [
213-
217] presents an excellent visualization and integration possibility, allowing the combinations of multiple levels of networks, dynamically changing, that will include cellular information, component and corresponding disease temporal progressions, as well as medical assay data in a modularized approach. The computational analyses and visualization of omics data integration also reveal the known need to manage large amounts of data [
218,
219], both in terms of processing power, as well as storage capacity and maintaining easy accessibility, especially for the practicing clinician - with the recent advent of cloud computing providing one possible solution. Finally, the combination of omics data with medical records presents another challenge, with privacy and ethical issues that must be considered. Such improvements and standardization of approaches will help make the analysis available in a clinical setting and an increasingly larger set of patients, while encouraging the early adaptation of the integrated approaches by the scientific community towards personalized medicine applications.
As technology improves we expect to see advancements in each omics implementation discussed above. In terms of sequencing, continual improvements in depth and read length will allow unambiguous precise sequence mapping and additionally the querying of lower gene expression, coupled to higher accuracy in variant calling. With sequencing times becoming faster (e.g., whole genome sequencing in ~5-30 hours depending on platform at deep, ~100× coverage), and hardware more compact, eventually such technology will be available in the clinic, enabling the incorporation of all genomic, transcriptomic, microbiomic and autoantibodyomic profiling as parts of regular medical examinations. Correspondingly, mass spectrometry improvements (including table-top hardware now available) will improve mass accuracy, and higher sensitivity, allowing increases in the number of proteins identified and better quantitation, which can already be implemented in a clinical setting. The MS improvements in combination with better metabolite cataloguing will also improve the identification of small molecules. The protocol and methodology advancements will allow using a smaller volume of patient sample needed for iPOP (decreasing from ~80 mL to drops of blood) making it feasible to probe the omics on more regular basis for each patient, even providing home kits to send in self-collected samples (akin to what is already implemented to some degree by companies, e.g., 23andMe, that collect saliva samples for phenotyping).
The technological and methodological advancements will allow for effective iPOP implementations with multiple patients, but it will still take some time to evaluate what constitutes actionable information and which components will be most informative. Once these relevant components are identified monitoring technologies can be further developed to help possible clinical implementations. This will certainly be alleviated by multiple iPOP studies providing the necessary aggregated information. However, clinical and psychological concerns need to be addressed and the possible impact to patient health being of paramount importance, in a medical process in which the patient is actively participating [
220]. Such active participation requires the training of the public and health professionals to an understanding of genomic information, and how this omics knowledge impacts their health, and their families. Genetic counseling is a necessity, and the number of trained genetic counselors is steadily increasing. Informed consent will be necessary, but this requires an understanding of basic genomic terms that are not apparent to non-experts. To facilitate this, probably school curriculum adjustments will be needed to enable early education of the public.
The emergence of quantitative Personal Omics, including genomes transcriptomes, proteomes, metabolomes and other omics allows us to now combine them to yield personalized actionable health care information. Such research is at the forefront of medical science, and may help the characterization of disorders and the implementation of precise personal medicine aimed towards prevention rather than treatment. Careful forward planning, coupled to the continuing interest and participation of the public, government agencies and researchers, assures that the development of personalized omics will proceed beyond possible hurtles into a novel approach for the 21st century health care implementations.
ACKNOWLEDGEMENTS
We would like to thank the Stanford Genetics Department and the NIH for support through grant P50HG02357. GIM would also like to thank the NIH for support through training grant T32HG000044. We also thank Drs. Rui Chen, Jennifer Li Pook Than and Hogune Im for useful discussions.
Higher Education Press and Springer-Verlag Berlin Heidelberg