Background: A novel coronavirus (the SARS-CoV-2) has been identified in January 2020 as the causal pathogen for COVID-19 , a pandemic started near the end of 2019. The Angiotensin converting enzyme 2 protein (ACE2) utilized by the SARS-CoV as a receptor was found to facilitate the infection of SARS-CoV-2, initiated by the binding of the spike protein to human ACE2.
Methods: Using homology modeling and molecular dynamics (MD) simulation methods, we report here the detailed structure and dynamics of the ACE2 in complex with the receptor binding domain (RBD) of the SARS-CoV-2 spike protein.
Results: The predicted model is highly consistent with the experimentally determined structures, validating the homology modeling results. Besides the binding interface reported in the crystal structures, novel binding poses are revealed from all-atom MD simulations. The simulation data are used to identify critical residues at the complex interface and provide more details about the interactions between the SARS-CoV-2 RBD and human ACE2.
Conclusion: Simulations reveal that RBD binds to both open and closed state of ACE2. Two human ACE2 mutants and rat ACE2 are modeled to study the mutation effects on RBD binding to ACE2. The simulations show that the N-terminal helix and the K353 are very important for the tight binding of the complex, the mutants are found to alter the binding modes of the CoV2-RBD to ACE2.
In May 1985 there was at University of California Santa Cruz an inﬂuential meeting that was the first serious discussion of sequencing the entire human genome. The author was one of the participants and described the meeting and related issues.
Background: Polygenic risk score (PRS) derived from summary statistics of genome-wide association studies (GWAS) is a useful tool to infer an individual’s genetic risk for health outcomes and has gained increasing popularity in human genetics research. PRS in its simplest form enjoys both computational efficiency and easy accessibility, yet the predictive performance of PRS remains moderate for diseases and traits.
Results: We provide an overview of recent advances in statistical methods to improve PRS’s performance by incorporating information from linkage disequilibrium, functional annotation, and pleiotropy. We also introduce model validation methods that fine-tune PRS using GWAS summary statistics.
Conclusion: In this review, we showcase methodological advances and current limitations of PRS, and discuss several emerging issues in risk prediction research.
Background: Microfluidic systems have advantages such as a high throughput, small reaction volume, and precise control of the cellular position and environment. These advantages have allowed microfluidics to be widely used in several fields of synthetic biology in recent years.
Results: In this article, we reviewed the microfluidic-based methods for synthetic biology from two aspects: the construction of synthetic gene circuits and the analysis of synthetic gene systems. We used some examples to illuminate the progresses and challenges in the steps of synthetic gene circuits construction and approaches of gene expression analysis with microfluidic systems.
Conclusion: Comparing to traditional methods, microfluidic tools promise great advantages in the synthetic genetic circuit building and analysis process. Moreover, new microfluidic systems together with the mathematical modeling of synthetic circuits or consortiums are desirable to perform complex genetic circuit construction and understand the natural gene regulation in cells and population interactions better.
Background: The direct-to-consumer genetic testing (DTC-GT) industry has exploded in recent years, initiated by market pioneers from the United States and quickly followed by companies from Europe and Asia. In addition to their primary objective of providing ancestry and health information to customers, DTC-GT services have emerged as a valuable data resource for large-scale population and genetics studies.
Methods: We assessed DTC-GT market leaders in the U.S. and China, user participation in research, and academic reports based on this information. We also investigated DTC-GT end-user value by tracing key updates of companies provided via health risk reports and evaluating their predictive power. We then assessed the replicability of several genome-wide association studies (GWAS) based on a Chinese DTC-GT biobank.
Results: As recent entrants to the market, Chinese DTC-GT service providers have published less academic research than their Western counterparts; however, a larger proportion of Chinese users consent to participate in research projects. Dramatic increases in user volume and resultant report updates led to reclassification of some users’ polygenic risk levels, but within a reasonable scale and with increased predictive power. Replicability among GWAS using the Chinese DTC-GT biobank varied by studied trait, population background, and sample size.
Conclusions: We speculate that the rapid growth in DTC-GT services, particularly in non-Caucasian populations, will yield an important and much-needed resource for biobanking, large-scale genetic studies, clinical trials, and post-clinical applications.
Background: Genome-wide association studies (GWAS) have succeeded in identifying tens of thousands of genetic variants associated with complex human traits during the past decade, however, they are still hampered by limited statistical power and difficulties in biological interpretation. With the recent progress in expression quantitative trait loci (eQTL) studies, transcriptome-wide association studies (TWAS) provide a framework to test for gene-trait associations by integrating information from GWAS and eQTL studies.
Results: In this review, we will introduce the general framework of TWAS, the relevant resources, and the computational tools. Extensions of the original TWAS methods will also be discussed. Furthermore, we will briefly introduce methods that are closely related to TWAS, including MR-based methods and colocalization approaches. Connection and difference between these approaches will be discussed.
Conclusion: Finally, we will summarize strengths, limitations, and potential directions for TWAS.
Background: The COVID-19 pandemic has become a formidable threat to global health and economy. The coronavirus SARS-CoV-2 that causes COVID-19 is known to spread by human-to-human transmission, and in about 40% cases, the exposed individuals are asymptomatic which makes it difficult to contain the virus.
Methods: This paper presents a modified SEIR epidemiological model and uses concepts of optimal control theory for analysis of the effects of intervention methods of the COVID19. Fundamentally the pandemic intervention problem can be viewed as a mathematical optimization problem as there are contradictory outcomes in terms of reduced infection and fatalities but with serious economic downturns.
Results: Concepts of optimal control theory have been used to determine the optimal control (intervention) levels of i) social contact mitigation and suppression, and ii) pharmaceutical intervention modalities, with minimum impacts on the economy. Numerical results show that with optimal intervention policies, there is a significant reduction in the number of infections and fatalities. The computed optimum intervention policy also provides a timeline of systematic enforcement and relaxation of stay-at-home regulations, and an estimate of the peak time and number of hospitalized critical care patients.
Conclusion: The proposed method could be used by local and state governments in planning effective strategies in combating the pandemic. The optimum intervention policy provides the necessary lead time to establish necessary field hospitals before getting overwhelmed by new patient arrivals. Our results also allow the local and state governments to relax social contact suppression guidelines in an orderly manner without triggering a second wave.
Background: Histone modifications are major factors that define chromatin states and have functions in regulating gene expression in eukaryotic cells. Chromatin immunoprecipitation coupled with high-throughput sequencing (ChIP-seq) technique has been widely used for profiling the genome-wide distribution of chromatin-associating protein factors. Some histone modifications, such as H3K27me3 and H3K9me3, usually mark broad domains in the genome ranging from kilobases (kb) to megabases (Mb) long, resulting in diffuse patterns in the ChIP-seq data that are challenging for signal separation. While most existing ChIP-seq peak-calling algorithms are based on local statistical models without account of multi-scale features, a principled method to identify scale-free board domains has been lacking.
Methods: Here we present RECOGNICER (Recursive coarse-graining identification for ChIP-seq enriched regions), a computational method for identifying ChIP-seq enriched domains on a large range of scales. The algorithm is based on a coarse-graining approach, which uses recursive block transformations to determine spatial clustering of local enriched elements across multiple length scales.
Results: We apply RECOGNICER to call H3K27me3 domains from ChIP-seq data, and validate the results based on H3K27me3’s association with repressive gene expression. We show that RECOGNICER outperforms existing ChIP-seq broad domain calling tools in identifying more whole domains than separated pieces.
Conclusion: RECOGNICER can be a useful bioinformatics tool for next-generation sequencing data analysis in epigenomics research.
Background: Genome-wide association studies (GWASs) have identified thousands of genetic variants that are associated with many complex traits. However, their biological mechanisms remain largely unknown. Transcriptome-wide association studies (TWAS) have been recently proposed as an invaluable tool for investigating the potential gene regulatory mechanisms underlying variant-trait associations. Specifically, TWAS integrate GWAS with expression mapping studies based on a common set of variants and aim to identify genes whose GReX is associated with the phenotype. Various methods have been developed for performing TWAS and/or similar integrative analysis. Each such method has a different modeling assumption and many were initially developed to answer different biological questions. Consequently, it is not straightforward to understand their modeling property from a theoretical perspective.
Results: We present a technical review on thirteen TWAS methods. Importantly, we show that these methods can all be viewed as two-sample Mendelian randomization (MR) analysis, which has been widely applied in GWASs for examining the causal effects of exposure on outcome. Viewing different TWAS methods from an MR perspective provides us a unique angle for understanding their benefits and pitfalls. We systematically introduce the MR analysis framework, explain how features of the GWAS and expression data influence the adaptation of MR for TWAS, and re-interpret the modeling assumptions made in different TWAS methods from an MR angle. We finally describe future directions for TWAS methodology development.
Conclusions: We hope that this review would serve as a useful reference for both methodologists who develop TWAS methods and practitioners who perform TWAS analysis.
Background: The lipostatic set-point theory, ascribing fat mass homeostasis to leptin mediated central feedback regulation targeting the body’s fat storage, has caused a variety of conundrums. We recently proposed a leanocentric locking-point theory and the corresponding mathematical model, which not only resolve these conundrums but also provide valuable insights into weight control and health assessment. This paper aims to further test the leanocentric theory.
Methods: Partial lipectomy is a touchstone to test both the leanocentric and lipostatic theories. Here we perform in silico lipectomy by using a mathematical model embodying the leanocentric theory to simulate the long-term body fat change after removing some fat cells in the body.
Results: The mathematical modeling uncovers a phenomenon called post-surgical fat loss, which was well-documented in real partial lipectomy surgeries; thus, the phenomenon can serve as an empirical support to the leanocentric theory. On the other hand, the leanocentric theory, but not the lipostatic theory, can well explain the post-surgical fat loss.
Conclusions: The leanocentric locking-point theory is a promising theory and deserves further testing. Partial lipectomy surgeries are beneficial to obese patients for quite a long period.
Background: Identifying patient-specific flow of signal transduction perturbed by multiple single-nucleotide alterations is critical for improving patient outcomes in cancer cases. However, accurate estimation of mutational effects at the pathway level for such patients remains an open problem. While probabilistic pathway topology methods are gaining interest among the scientific community, the overwhelming majority do not account for network perturbation effects from multiple single-nucleotide alterations.
Methods: Here we present an improvement of the mutational forks formalism to infer the patient-specific flow of signal transduction based on multiple single-nucleotide alterations, including non-synonymous and synonymous mutations. The lung adenocarcinoma and skin cutaneous melanoma datasets from TCGA Pan-Cancer Atlas have been employed to show the utility of the proposed method.
Results: We have comprehensively characterized six mutational forks. The number of mutated nodes ranged from one to four depending on the topological characteristics of a fork. Transitional confidences (TCs) have been computed for every possible combination of single-nucleotide alterations in the fork. The performed analysis demonstrated the capacity of the mutational forks formalism to follow a biologically explainable logic in the identification of high-likelihood signaling routes in lung adenocarcinoma and skin cutaneous melanoma patients. The findings have been largely supported by the evidence from the biomedical literature.
Conclusion: We conclude that the formalism has a great chance to enable an assessment of patient-specific flow by leveraging information from multiple single-nucleotide alterations to adjust the transitional likelihoods that are solely based on the canonical view of a disease.
Background: The extremely small amount of DNA in a cell makes it difficult to study the whole genome of single cells, so whole-genome amplification (WGA) is necessary to increase the DNA amount and enable downstream analyses. Multiple displacement amplification (MDA) is the most widely used WGA technique.
Results: Compared with amplification methods based on PCR and other methods, MDA renders high-quality DNA products and better genome coverage by using phi29 DNA polymerase. Moreover, recently developed advanced MDA technologies such as microreactor MDA, emulsion MDA, and micro-channel MDA have improved amplification uniformity. Additionally, the development of other novel methods such as TruePrime WGA allows for amplification without primers.
Conclusion: Here, we reviewed a selection of recently developed MDA methods, their advantages over other WGA methods, and improved MDA-based technologies, followed by a discussion of future perspectives. With the continuous development of MDA and the successive update of detection technologies, MDA will be applied in increasingly more fields and provide a solid foundation for scientific research.
COVID-19 is now rapidly spreading worldwide. While the majority of COVID-19 patients show only mild or moderate symptoms, some could deteriorate quickly and may succumb to a sudden death. It is therefore important to identify who will be more likely to develop severe outcomes and be treated with particular or preventive care. Here in this literature survey, we collected epidemiologic and clinical data from 36 articles on 51,270 patients with different severity of COVID-19, aiming to characterize the population that are prone to severe condition and bad outcomes. These data reveal that old males and those with high BMI or underlying diseases, especially cardiovascular disease, hypertension and diabetes, are overrepresented among severe cases. High leukocyte and lymphopenia are common features in severe and critical patients. Upon deterioration of the disease, both CD4+ and CD8+ T cells are decreased, while almost all serum cytokines, especially pro-inflammatory cytokines, increased.
Background: Mendelian randomization (MR) analysis has become popular in inferring and estimating the causality of an exposure on an outcome due to the success of genome wide association studies. Many statistical approaches have been developed and each of these methods require specific assumptions.
Results: In this article, we review the pros and cons of these methods. We use an example of high-density lipoprotein cholesterol on coronary artery disease to illuminate the challenges in Mendelian randomization investigation.
Conclusion: The current available MR approaches allow us to study causality among risk factors and outcomes. However, novel approaches are desirable for overcoming multiple source confounding of risk factors and an outcome in MR analysis.
Background: Genome wide association studies (GWAS) have identified many genetic variants associated with increased risk of Alzheimer’s disease (AD). These susceptibility loci may effect AD indirectly through a combination of physiological brain changes. Many of these neuropathologic features are detectable via magnetic resonance imaging (MRI).
Methods: In this study, we examine the effects of such brain imaging derived phenotypes (IDPs) with genetic etiology on AD, using and comparing the following methods: two-sample Mendelian randomization (2SMR), generalized summary statistics based Mendelian randomization (GSMR), transcriptome wide association studies (TWAS) and the adaptive sum of powered score (aSPU) test. These methods do not require individual-level genotypic and phenotypic data but instead can rely only on an external reference panel and GWAS summary statistics.
Results: Using publicly available GWAS datasets from the International Genomics of Alzheimer’s Project (IGAP) and UK Biobank’s (UKBB) brain imaging initiatives, we identify 35 IDPs possibly associated with AD, many of which have well established or biologically plausible links to the characteristic cognitive impairments of this neurodegenerative disease.
Conclusions: Our results highlight the increased power for detecting genetic associations achieved by multiple correlated SNP-based methods, i.e., aSPU, GSMR and TWAS, over MR methods based on independent SNPs (as instrumental variables).
Availability: Example code is available at https://github.com/kathalexknuts/ADIDP.
Background: Herpes simplex virus type 1 (HSV-1) is a ubiquitous infectious pathogen that widely affects human health. To decipher the complicated human-HSV-1 interactions, a comprehensive protein-protein interaction (PPI) network between human and HSV-1 is highly demanded.
Methods: To complement the experimental identification of human-HSV-1 PPIs, an integrative strategy to predict proteome-wide PPIs between human and HSV-1 was developed. For each human-HSV-1 protein pair, four popular PPI inference methods, including interolog mapping, the domain-domain interaction-based method, the domain-motif interaction-based method, and the machine learning-based method, were optimally implemented to generate four interaction probability scores, which were further integrated into a final probability score.
Results: As a result, a comprehensive high-confidence PPI network between human and HSV-1 was established, covering 10,432 interactions between 4,546 human proteins and 72 HSV-1 proteins. Functional and network analyses of the HSV-1 targeting proteins in the context of human interactome can recapitulate the known knowledge regarding the HSV-1 replication cycle, supporting the overall reliability of the predicted PPI network. Considering that HSV-1 infections are implicated in encephalitis and neurodegenerative diseases, we focused on exploring the biological significance of the brain-specific human-HSV-1 PPIs. In particular, the predicted interactions between HSV-1 proteins and Alzheimer’s-disease-related proteins were intensively investigated.
Conclusion: The current work can provide testable hypotheses to assist in the mechanistic understanding of the human-HSV-1 relationship and the anti-HSV-1 pharmaceutical target discovery. To make the predicted PPI network and the datasets freely accessible to the scientific community, a user-friendly database browser was released at http://www.zzdlab.com/HintHSV/index.php.
Background: Whole-exome sequencing (WES) studies have identified multiple genes enriched for de novo mutations (DNMs) in congenital heart disease (CHD) probands. However, risk gene identification based on DNMs alone remains statistically challenging due to heterogenous etiology of CHD and low mutation rate in each gene.
Methods: In this manuscript, we introduce a hierarchical Bayesian framework for gene-level association test which jointly analyzes de novo and rare transmitted variants. Through integrative modeling of multiple types of genetic variants, gene-level annotations, and reference data from large population cohorts, our method accurately characterizes the expected frequencies of both de novo and transmitted variants and shows improved statistical power compared to analyses based on DNMs only.
Results: Applied to WES data of 2,645 CHD proband-parent trios, our method identified 15 significant genes, half of which are novel, leading to new insights into the genetic bases of CHD.
Conclusion: These results showcase the power of integrative analysis of transmitted and de novo variants for disease gene discovery.
Background: COVID-19 has been impacting on the whole world critically and constantly since late December 2019. Rapidly increasing infections has raised intense worldwide attention. How to model the evolution of COVID-19 effectively and efficiently is of great significance for prevention and control.
Methods: We propose the multi-chain Fudan-CCDC model based on the original single-chain model in [Shao et al. 2020] to describe the evolution of COVID-19 in Singapore. Multi-chains can be considered as the superposition of several single chains with different characteristics. We identify the parameters of models by minimizing the penalty function.
Results: The numerical simulation results exhibit the multi-chain model performs well on data fitting. Though unsteady the increments are, they could still fall within the range of ±30% fluctuation from simulation results.
Conclusion: The multi-chain Fudan-CCDC model provides an effective way to early detect the appearance of imported infectors and super spreaders and forecast a second outbreak. It can also explain the data from those countries where the single-chain model shows deviation from the data.
Background: Now the coronavirus disease 2019 (COVID-19) epidemic becomes a global phenomenon and its development concerns billions of peoples’ lives. The development of the COVID-19 epidemic in China could be used as a reference for the other countries’ control strategy.
Methods: We used a classical susceptible-infected-recovered (SIR) model to forecast the development of the COVID-19 epidemic in China by nowcasting. The linear regression analyses were employed to predict the COVID-19 epidemic’s inflexion point. Finally, we used a susceptible-exposed-infected-recovered (SEIR) model to simulate the development of the COVID-19 epidemic in China throughout 2020.
Results: Our nowcasts show that the COVID-19 transmission rate started to slow down on January 30. The linear regression analyses further show that the inflexion point of this epidemic would arrive between February 17 and 18. The final SEIR model simulation forecasted that the COVID-19 epidemic would probably infect about 82,000 people and last throughout 2020 in China. We also applied our method to USA’s and global COVID-19 data and the nowcasts show that the development of COVID-19 pandemic is not optimistic in the rest of 2020.
Conclusion: The COVID-19 epidemic’s scale in China is much smaller than the previous estimations. After implemented strict control and prevention measures, such as city lockdown, it took a week to slow down the COVID-19 transmission and about four weeks to really mitigate the COVID-19 prevalence in China.
Background: The Genotype-Tissue Expression (GTEx) Project has collected genetic and transcriptome profiles from a wide spectrum of tissues in nearly 1,000 ceased individuals, providing an opportunity to study the regulatory roles of genetic variants in transcriptome activities from both cross-tissue and tissue-specific perspectives. Moreover, transcriptome activities (e.g., transcript abundance and alternative splicing) can be treated as mediators between genotype and phenotype to achieve phenotypic alteration. Knowing the genotype associated transcriptome status, researchers can better understand the biological and molecular mechanisms of genetic risk variants in complex traits.
Results: In this article, we first explore the genetic architecture of gene expression traits, and then review recent methods on quantitative trait locus (QTL) and co-expression network analysis. To further exemplify the usage of associations between genotype and transcriptome status, we briefly review methods that either directly or indirectly integrate expression/splicing QTL information in genome-wide association studies (GWASs).
Conclusions: The GTEx Project provides the largest and useful resource to investigate the associations between genotype and transcriptome status. The integration of results from the GTEx Project and existing GWASs further advances our understanding of roles of gene expression changes in bridging both the genetic variants and complex traits.
Background: High-order chromatin structure has been shown to play a vital role in gene regulation. Previously we identified two types of sequence domains, CGI (CpG island) forest and CGI prairie, which tend to spatially segregate, but to different extent in different tissues. Here we aim to further quantify the association of domain segregation with gene regulation and therefore differentiation.
Methods: By means of the published RNA-seq and Hi-C data, we identified tissue-specific genes and quantitatively investigated how their regulation is relevant to chromatin structure. Besides, two types of gene networks were constructed and the association between gene pair co-regulation and genome organization is discussed.
Results: We show that compared to forests, tissue-specific genes tend to be enriched in prairies. Highly specific genes also tend to cluster according to their functions in a relatively small number of prairies. Furthermore, tissue-specific forest-prairie contact formation was associated with the regulation of tissue-specific genes, in particular those in the prairie domains, pointing to the important role of gene positioning, in the linear DNA sequence as well as in 3D chromatin structure, in gene regulatory network formation.
Conclusion: We investigated how gene regulation is related to genome organization from the perspective of forest-prairie spatial interactions. Since unlike compartments A and B, forest and prairie are identified solely based on sequence properties. Therefore, the simple and uniform framework (forest-prairie domain segregation) provided here can be utilized to further understand the chromatin structure changes as well as the underlying biological significances in different stages, such as tumorgenesis.
Background: Genome-wide association studies (GWAS) have been widely adopted in studies of human complex traits and diseases.
Results: This review surveys areas of active research: quantifying and partitioning trait heritability, fine mapping functional variants and integrative analysis, genetic risk prediction of phenotypes, and the analysis of sequencing studies that have identified millions of rare variants. Current challenges and opportunities are highlighted.
Conclusion: GWAS have fundamentally transformed the field of human complex trait genetics. Novel statistical and computational methods have expanded the scope of GWAS and have provided valuable insights on the genetic architecture underlying complex phenotypes.
Background: RNA secondary structures play a pivotal role in posttranscriptional regulation and the functions of non-coding RNAs, yet in vivo RNA secondary structures remain enigmatic. PARIS (Psoralen Analysis of RNA Interactions and Structures) is a recently developed high-throughput sequencing-based approach that enables direct capture of RNA duplex structures in vivo. However, the existence of incompatible, fuzzy pairing information obstructs the integration of PARIS data with the existing tools for reconstructing RNA secondary structure models at the single-base resolution.
Methods: We introduce IRIS, a method for predicting RNA secondary structure ensembles based on PARIS data. IRIS generates a large set of candidate RNA secondary structure models under the guidance of redistributed PARIS reads and then uses a Bayesian model to identify the optimal ensemble, according to both thermodynamic principles and PARIS data.
Results: The predicted RNA structure ensembles by IRIS have been verified based on evolutionary conservation information and consistency with other experimental RNA structural data. IRIS is implemented in Python and freely available at http://iris.zhanglab.net.
Conclusion: IRIS capitalizes upon PARIS data to improve the prediction of in vivo RNA secondary structure ensembles. We expect that IRIS will enhance the application of the PARIS technology and shed more insight on in vivo RNA secondary structures.
Background: With improvements in next-generation DNA sequencing technology, lower cost is needed to collect genetic data. More machine learning techniques can be used to help with cancer analysis and diagnosis.
Methods: We developed an ensemble machine learning system named performance-weighted-voting model for cancer type classification in 6,249 samples across 14 cancer types. Our ensemble system consists of five weak classifiers (logistic regression, SVM, random forest, XGBoost and neural networks). We first used cross-validation to get the predicted results for the five classifiers. The weights of the five weak classifiers can be obtained based on their predictive performance by solving linear regression functions. The final predicted probability of the performance-weighted-voting model for a cancer type can be determined by the summation of each classifier’s weight multiplied by its predicted probability.
Results: Using the somatic mutation count of each gene as the input feature, the overall accuracy of the performance-weighted-voting model reached 71.46%, which was significantly higher than the five weak classifiers and two other ensemble models: the hard-voting model and the soft-voting model. In addition, by analyzing the predictive pattern of the performance-weighted-voting model, we found that in most cancer types, higher tumor mutational burden can improve overall accuracy.
Conclusion: This study has important clinical significance for identifying the origin of cancer, especially for those where the primary cannot be determined. In addition, our model presents a good strategy for using ensemble systems for cancer type classification.