The CRISPR-Cas9 system, naturally a defense mechanism in prokaryotes, has been repurposed as an RNA-guided DNA targeting platform. It has been widely used for genome editing and transcriptome modulation, and has shown great promise in correcting mutations in human genetic diseases. Off-target effects are a critical issue for all of these applications. Here we review the current status on the target specificity of the CRISPR-Cas9 system.
The specificity of protein-DNA interactions is most commonly modeled using position weight matrices (PWMs). First introduced in 1982, they have been adapted to many new types of data and many different approaches have been developed to determine the parameters of the PWM. New high-throughput technologies provide a large amount of data rapidly and offer an unprecedented opportunity to determine accurately the specificities of many transcription factors (TFs). But taking full advantage of the new data requires advanced algorithms that take into account the biophysical processes involved in generating the data. The new large datasets can also aid in determining when the PWM model is inadequate and must be extended to provide accurate predictions of binding sites. This article provides a general mathematical description of a PWM and how it is used to score potential binding sites, a brief history of the approaches that have been developed and the types of data that are used with an emphasis on algorithms that we have developed for analyzing high-throughput datasets from several new technologies. It also describes extensions that can be added when the simple PWM model is inadequate and further enhancements that may be necessary. It briefly describes some applications of PWMs in the discovery and modeling of
Background: Many existing bioinformatics predictors are based on machine learning technology. When applying these predictors in practical studies, their predictive performances should be well understood. Different performance measures are applied in various studies as well as different evaluation methods. Even for the same performance measure, different terms, nomenclatures or notations may appear in different context.
Results: We carried out a review on the most commonly used performance measures and the evaluation methods for bioinformatics predictors.
Conclusions: It is important in bioinformatics to correctly understand and interpret the performance, as it is the key to rigorously compare performances of different predictors and to choose the right predictor.
Background: The coronavirus disease 2019 (COVID-19) is rapidly spreading in China and more than 30 countries over last two months. COVID-19 has multiple characteristics distinct from other infectious diseases, including high infectivity during incubation, time delay between real dynamics and daily observed number of confirmed cases, and the intervention effects of implemented quarantine and control measures.
Methods: We develop a Susceptible, Un-quanrantined infected, Quarantined infected, Confirmed infected (SUQC) model to characterize the dynamics of COVID-19 and explicitly parameterize the intervention effects of control measures, which is more suitable for analysis than other existing epidemic models.
Results: The SUQC model is applied to the daily released data of the confirmed infections to analyze the outbreak of COVID-19 in Wuhan, Hubei (excluding Wuhan), China (excluding Hubei) and four first-tier cities of China. We found that, before January 30, 2020, all these regions except Beijing had a reproductive number
Conclusions: We suggest that rigorous quarantine and control measures should be kept before early March in Beijing, Shanghai, Guangzhou and Shenzhen, and before late March in Hubei. The model can also be useful to predict the trend of epidemic and provide quantitative guide for other countries at high risk of outbreak, such as South Korea, Japan, Italy and Iran.
Non-smooth or even abrupt state changes exist during many biological processes, e.g., cell differentiation processes, proliferation processes, or even disease deterioration processes. Such dynamics generally signals the emergence of critical transition phenomena, which result in drastic changes of system states or eventually qualitative changes of phenotypes. Hence, it is of great importance to detect such transitions and further reveal their molecular mechanisms at network level. Here, we review the recent advances on dynamical network biomarkers (DNBs) as well as the related theoretical foundation, which can identify not only early signals of the critical transitions but also their leading networks, which drive the whole system to initiate such transitions. In order to demonstrate the effectiveness of this novel approach, examples of complex diseases are also provided to detect pre-disease stage, for which traditional methods or biomarkers failed.
Background: The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture. Existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data.
Methods: Here we developed a reference-free and alignment-free machine learning method, DeepVirFinder, for identifying viral sequences in metagenomic data using deep learning.
Results: Trained based on sequences from viral RefSeq discovered before May 2015, and evaluated on those discovered after that date, DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths, achieving AUROC 0.93, 0.95, 0.97, and 0.98 for 300, 500, 1000, and 3000 bp sequences respectively. Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under-represented. Applying DeepVirFinder to real human gut metagenomic samples, we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma (CRC). Ten bins were found associated with the cancer status, suggesting viruses may play important roles in CRC.
Conclusions: Powered by deep learning and high throughput sequencing metagenomic data, DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.
One goal of precise oncology is to re-classify cancer based on molecular features rather than its tissue origin. Integrative clustering of large-scale multi-omics data is an important way for molecule-based cancer classification. The data heterogeneity and the complexity of inter-omics variations are two major challenges for the integrative clustering analysis. According to the different strategies to deal with these difficulties, we summarized the clustering methods as three major categories: direct integrative clustering, clustering of clusters and regulatory integrative clustering. A few practical considerations on data pre-processing, post-clustering analysis and pathway-based analysis are also discussed.
Experimental evidences and theoretical analyses have amply suggested that in cancer genesis and progression genetic information is very important but not the whole. Nevertheless, “cancer as a disease of the genome” is still currently the dominant doctrine. With such a background and based on the fundamental properties of biological systems, a new endogenous molecular-cellular network theory for cancer was recently proposed by us. Similar proposals were also made by others. The new theory attempts to incorporate both genetic and environmental effects into one single framework, with the possibility to give a quantitative and dynamical description. It is asserted that the complex regulatory machinery behind biological processes may be modeled by a nonlinear stochastic dynamical system similar to a noise perturbed Morse-Smale system. Both qualitative and quantitative descriptions may be obtained. The dynamical variables are specified by a set of endogenous molecular-cellular agents and the structure of the dynamical system by the interactions among those biological agents. Here we review this theory from a pedagogical angle which emphasizes the role of modularization, hierarchy and autonomous regulation. We discuss how the core set of assumptions is exemplified in detail in one of the simple, important and well studied model organisms, Phage lambda. With this concrete and quantitative example in hand, we show that the application of the hypothesized theory in human cancer, such as hepatocellular carcinoma (HCC), is plausible, and that it may provide a set of new insights on understanding cancer genesis and progression, and on strategies for cancer prevention, cure, and care.
The species accumulation curve, or collector’s curve, of a population gives the expected number of observed species or distinct classes as a function of sampling effort. Species accumulation curves allow researchers to assess and compare diversity across populations or to evaluate the benefits of additional sampling. Traditional applications have focused on ecological populations but emerging large-scale applications, for example in DNA sequencing, are orders of magnitude larger and present new challenges. We developed a method to estimate accumulation curves for predicting the complexity of DNA sequencing libraries. This method uses rational function approximations to a classical non-parametric empirical Bayes estimator due to Good and Toulmin [Biometrika, 1956, 43, 45–63]. Here we demonstrate how the same approach can be highly effective in other large-scale applications involving biological data sets. These include estimating microbial species richness, immune repertoire size, and k-mer diversity for genome assembly applications. We show how the method can be modified to address populations containing an effectively infinite number of species where saturation cannot practically be attained. We also introduce a flexible suite of tools implemented as an R package that make these methods broadly accessible.
Background: Since the invention of next-generation RNA sequencing (RNA-seq) technologies, they have become a powerful tool to study the presence and quantity of RNA molecules in biological samples and have revolutionized transcriptomic studies. The analysis of RNA-seq data at four different levels (samples, genes, transcripts, and exons) involve multiple statistical and computational questions, some of which remain challenging up to date.
Results: We review RNA-seq analysis tools at the sample, gene, transcript, and exon levels from a statistical perspective. We also highlight the biological and statistical questions of most practical considerations.
Conclusions: The development of statistical and computational methods for analyzing RNA-seq data has made significant advances in the past decade. However, methods developed to answer the same biological question often rely on diverse statistical models and exhibit different performance under different scenarios. This review discusses and compares multiple commonly used statistical models regarding their assumptions, in the hope of helping users select appropriate methods as needed, as well as assisting developers for future method development.
Understanding how chromosomes fold provides insights into the transcription regulation, hence, the functional state of the cell. Using the next generation sequencing technology, the recently developed Hi-C approach enables a global view of spatial chromatin organization in the nucleus, which substantially expands our knowledge about genome organization and function. However, due to multiple layers of biases, noises and uncertainties buried in the protocol of Hi-C experiments, analyzing and interpreting Hi-C data poses great challenges, and requires novel statistical methods to be developed. This article provides an overview of recent Hi-C studies and their impacts on biomedical research, describes major challenges in statistical analysis of Hi-C data, and discusses some perspectives for future research.
The rapid technological developments following the Human Genome Project have made possible the availability of personalized genomes. As the focus now shifts from characterizing genomes to making personalized disease associations, in combination with the availability of other omics technologies, the next big push will be not only to obtain a personalized genome, but to quantitatively follow other omics. This will include transcriptomes, proteomes, metabolomes, antibodyomes, and new emerging technologies, enabling the profiling of thousands of molecular components in individuals. Furthermore, omics profiling performed longitudinally can probe the temporal patterns associated with both molecular changes and associated physiological health and disease states. Such data necessitates the development of computational methodology to not only handle and descriptively assess such data, but also construct quantitative biological models. Here we describe the availability of personal genomes and developing omics technologies that can be brought together for personalized implementations and how these novel integrated approaches may effectively provide a precise personalized medicine that focuses on not only characterization and treatment but ultimately the prevention of disease.
Much of our current knowledge of biology has been constructed based on population-average measurements. However, advances in single-cell analysis have demonstrated the omnipresent nature of cell-to-cell variability in any population. On one hand, tremendous efforts have been made to examine how such variability arises, how it is regulated by cellular networks, and how it can affect cell-fate decisions by single cells. On the other hand, recent studies suggest that the variability may carry valuable information that can facilitate the elucidation of underlying regulatory networks or the classification of cell states. To this end, a major challenge is determining what aspects of variability bear significant biological meaning. Addressing this challenge requires the development of new computational tools, in conjunction with appropriately chosen experimental platforms, to more effectively describe and interpret data on cell-cell variability. Here, we discuss examples of when population heterogeneity plays critical roles in determining biologically and clinically significant phenotypes, how it serves as a rich information source of regulatory mechanisms, and how we can extract such information to gain a deeper understanding of biological systems.
Fluctuating environments pose tremendous challenges to bacterial populations. It is observed in numerous bacterial species that individual cells can stochastically switch among multiple phenotypes for the population to survive in rapidly changing environments. This kind of phenotypic heterogeneity with stochastic phenotype switching is generally understood to be an adaptive bet-hedging strategy. Mathematical models are essential to gain a deeper insight into the principle behind bet-hedging and the pattern behind experimental data. Traditional deterministic models cannot provide a correct description of stochastic phenotype switching and bet-hedging, and traditional Markov chain models at the cellular level fail to explain their underlying molecular mechanisms. In this paper, we propose a nonlinear stochastic model of multistable bacterial systems at the molecular level. It turns out that our model not only provides a clear description of stochastic phenotype switching and bet-hedging within isogenic bacterial populations, but also provides a deeper insight into the analysis of multidimensional experimental data. Moreover, we use some deep mathematical theories to show that our stochastic model and traditional Markov chain models are essentially consistent and reflect the dynamic behavior of the bacterial system at two different time scales. In addition, we provide a quantitative characterization of the critical state of multistable bacterial systems and develop an effective data-driven method to identify the critical state without resorting to specific mathematical models.
Background: In the human genome, distal enhancers are involved in regulating target genes through proximal promoters by forming enhancer-promoter interactions. Although recently developed high-throughput experimental approaches have allowed us to recognize potential enhancer-promoter interactions genome-wide, it is still largely unclear to what extent the sequence-level information encoded in our genome help guide such interactions.
Methods: Here we report a new computational method (named “SPEID”) using deep learning models to predict enhancer-promoter interactions based on sequence-based features only, when the locations of putative enhancers and promoters in a particular cell type are given.
Results: Our results across six different cell types demonstrate that SPEID is effective in predicting enhancer-promoter interactions as compared to state-of-the-art methods that only use information from a single cell type. As a proof-of-principle, we also applied SPEID to identify somatic non-coding mutations in melanoma samples that may have reduced enhancer-promoter interactions in tumor genomes.
Conclusions: This work demonstrates that deep learning models can help reveal that sequence-based features alone are sufficient to reliably predict enhancer-promoter interactions genome-wide.
In complex systems, the interplay between nonlinear and stochastic dynamics, e.g., J. Monod’s necessity and chance, gives rise to an evolutionary process in Darwinian sense, in terms of discrete jumps among attractors, with punctuated equilibria, spontaneous random “mutations” and “adaptations”. On an evolutionary time scale it produces sustainable diversity among individuals in a homogeneous population rather than convergence as usually predicted by a deterministic dynamics. The emergent discrete states in such a system, i.e., attractors, have natural robustness against both internal and external perturbations. Phenotypic states of a biological cell, a mesoscopic nonlinear stochastic open biochemical system, could be understood through such a perspective.
Background: In recent years, since the molecular docking technique can greatly improve the efficiency and reduce the research cost, it has become a key tool in computer-assisted drug design to predict the binding affinity and analyze the interactive mode.
Results: This study introduces the key principles, procedures and the widely-used applications for molecular docking. Also, it compares the commonly used docking applications and recommends which research areas are suitable for them. Lastly, it briefly reviews the latest progress in molecular docking such as the integrated method and deep learning.
Conclusion: Limited to the incomplete molecular structure and the shortcomings of the scoring function, current docking applications are not accurate enough to predict the binding affinity. However, we could improve the current molecular docking technique by integrating the big biological data into scoring function.
Background: Single-cell RNA sequencing (scRNA-seq) is an emerging technology that enables high resolution detection of heterogeneities between cells. One important application of scRNA-seq data is to detect differential expression (DE) of genes. Currently, some researchers still use DE analysis methods developed for bulk RNA-Seq data on single-cell data, and some new methods for scRNA-seq data have also been developed. Bulk and single-cell RNA-seq data have different characteristics. A systematic evaluation of the two types of methods on scRNA-seq data is needed.
Results: In this study, we conducted a series of experiments on scRNA-seq data to quantitatively evaluate 14 popular DE analysis methods, including both of traditional methods developed for bulk RNA-seq data and new methods specifically designed for scRNA-seq data. We obtained observations and recommendations for the methods under different situations.
Conclusions: DE analysis methods should be chosen for scRNA-seq data with great caution with regard to different situations of data. Different strategies should be taken for data with different sample sizes and/or different strengths of the expected signals. Several methods for scRNA-seq data show advantages in some aspects, and DEGSeq tends to outperform other methods with respect to consistency, reproducibility and accuracy of predictions on scRNA-seq data.
Results: We review current practices in analysis of structure profiling data with emphasis on comparative and integrative analysis as well as highlight emerging questions. Comparative analysis has revealed structural patterns across transcriptomes and has become an integral component of recent profiling studies. Additionally, profiling data can be integrated into traditional structure prediction algorithms to improve prediction accuracy.
Conclusions: To keep pace with experimental developments, methods to facilitate, enhance and refine such analyses are needed. Parallel advances in analysis methodology will complement profiling technologies and help them reach their full potential.
Deep learning is making major breakthrough in several areas of bioinformatics. Anticipating that this will occur soon for the single-cell RNA-seq data analysis, we review newly published deep learning methods that help tackle computational challenges. Autoencoders are found to be the dominant approach. However, methods based on deep generative models such as generative adversarial networks (GANs) are also emerging in this area.
Background: The therapeutic potential of bacteriophages has been debated since their first isolation and characterisation in the early 20th century. However, a lack of consistency in application and observed efficacy during their early use meant that upon the discovery of antibiotic compounds research in the field of phage therapy quickly slowed. The rise of antibiotic resistance in bacteria and improvements in our abilities to modify and manipulate DNA, especially in the context of small viral genomes, has led to a recent resurgence of interest in utilising phage as antimicrobial therapeutics.
Results: In this article a number of results from the literature that have aimed to address key issues regarding the utility and efficacy of phage as antimicrobial therapeutics utilising molecular biology and synthetic biology approaches will be introduced and discussed, giving a general view of the recent progress in the field.
Conclusions: Advances in molecular biology and synthetic biology have enabled rapid progress in the field of phage engineering, with this article highlighting a number of promising strategies developed to optimise phages for the treatment of bacterial disease. Whilst many of the same issues that have historically limited the use of phages as therapeutics still exist, these modifications, or combinations thereof, may form a basis upon which future advances can be built. A focus on rigorous in vivo testing and investment in clinical trials for promising candidate phages may be required for the field to truly mature, but there is renewed hope that the potential benefits of phage therapy may finally be realised.
Imaging genetics is an emerging field aimed at identifying and characterizing genetic variants that influence measures derived from anatomical or functional brain images, which are in turn related to brain-related illnesses or fundamental cognitive, emotional and behavioral processes, and are affected by environmental factors. Here we review the recent evolution of statistical approaches and outstanding challenges in imaging genetics, with a focus on population-based imaging genetic association studies. We show the trend in imaging genetics from candidate approaches to pure discovery science, and from univariate to multivariate analyses. We also discuss future directions and prospects of imaging genetics for ultimately helping understand the genetic and environmental underpinnings of various neuropsychiatric disorders and turning basic science into clinical strategies.
Background: Most eukaryotic protein-coding genes exhibit alternative cleavage and polyadenylation (APA), resulting in mRNA isoforms with different 3′ untranslated regions (3′ UTRs). Studies have shown that brain cells tend to express long 3′ UTR isoforms using distal cleavage and polyadenylation sites (PASs).
Methods: Using our recently developed, comprehensive PAS database PolyA_DB, we developed an efficient method to examine APA, named Significance Analysis of Alternative Polyadenylation using RNA-seq (SAAP-RS). We applied this method to study APA in brain cells and neurogenesis.
Results: We found that neurons globally express longer 3′ UTRs than other cell types in brain, and microglia and endothelial cells express substantially shorter 3′ UTRs. We show that the 3′ UTR diversity across brain cells can be corroborated with single cell sequencing data. Further analysis of APA regulation of 3′ UTRs during differentiation of embryonic stem cells into neurons indicates that a large fraction of the APA events regulated in neurogenesis are similarly modulated in myogenesis, but to a much greater extent.
Conclusion: Together, our data delineate APA profiles in different brain cells and indicate that APA regulation in neurogenesis is largely an augmented process taking place in other types of cell differentiation.
In this paper we present NPEST, a novel tool for the analysis of expressed sequence tags (EST) distributions and transcription start site (TSS) prediction. This method estimates an unknown probability distribution of ESTs using a maximum likelihood (ML) approach, which is then used to predict positions of TSS. Accurate identification of TSS is an important genomics task, since the position of regulatory elements with respect to the TSS can have large effects on gene regulation, and performance of promoter motif-finding methods depends on correct identification of TSSs. Our probabilistic approach expands recognition capabilities to multiple TSS per locus that may be a useful tool to enhance the understanding of alternative splicing mechanisms. This paper presents analysis of simulated data as well as statistical analysis of promoter regions of a model dicot plant
The cis-acting regulatory elements, e.g., promoters and ribosome binding sites (RBSs) with various desired properties, are building blocks widely used in synthetic biology for fine tuning gene expression. In the last decade, acquisition of a controllable regulatory element from a random library has been established and applied to control the protein expression and metabolic flux in different chassis cells. However, more rational strategies are still urgently needed to improve the efficiency and reduce the laborious screening and multifaceted characterizations. Building precise computational models that can predict the activity of regulatory elements and quantitatively design elements with desired strength have been demonstrated tremendous potentiality. Here, recent progress on construction of cis-acting regulatory element library and the quantitative predicting models for design of such elements are reviewed and discussed in detail.
Background: The increase in global population, climate change and stagnancy in crop yield on unit land area basis in recent decades urgently call for a new approach to support contemporary crop improvements. ePlant is a mathematical model of plant growth and development with a high level of mechanistic details to meet this challenge.
Results: ePlant integrates modules developed for processes occurring at drastically different temporal (10–8–106 seconds) and spatial (10–10–10 meters) scales, incorporating diverse physical, biophysical and biochemical processes including gene regulation, metabolic reaction, substrate transport and diffusion, energy absorption, transfer and conversion, organ morphogenesis, plant environment interaction, etc. Individual modules are developed using a divide-and-conquer approach; modules at different temporal and spatial scales are integrated through transfer variables. We further propose a supervised learning procedure based on information geometry to combine model and data for both knowledge discovery and model extension or advances. We finally discuss the recent formation of a global consortium, which includes experts in plant biology, computer science, statistics, agronomy, phenomics, etc. aiming to expedite the development and application of ePlant or its equivalents by promoting a new model development paradigm where models are developed as a community effort instead of driven mainly by individual labs’ effort.
Conclusions: ePlant, as a major research tool to support quantitative and predictive plant science research, will play a crucial role in the future model guided crop engineering, breeding and agronomy.
Background: Self-sustained oscillations are a ubiquitous and vital phenomenon in living systems. From primitive single-cellular bacteria to the most sophisticated organisms, periodicities have been observed in a broad spectrum of biological processes such as neuron firing, heart beats, cell cycles, circadian rhythms, etc. Defects in these oscillators can cause diseases from insomnia to cancer. Elucidating their fundamental mechanisms is of great significance to diseases, and yet challenging, due to the complexity and diversity of these oscillators.
Results: Approaches in quantitative systems biology and synthetic biology have been most effective by simplifying the systems to contain only the most essential regulators. Here, we will review major progress that has been made in understanding biological oscillators using these approaches. The quantitative systems biology approach allows for identification of the essential components of an oscillator in an endogenous system. The synthetic biology approach makes use of the knowledge to design the simplest, de novo oscillators in both live cells and cell-free systems. These synthetic oscillators are tractable to further detailed analysis and manipulations.
Conclusion: With the recent development of biological and computational tools, both approaches have made significant achievements.
Chromatin immunoprecipitation coupled with massive parallel sequencing (ChIP-seq) is a powerful technology to identify the genome-wide locations of DNA binding proteins such as transcription factors or modified histones. As more and more experimental laboratories are adopting ChIP-seq to unravel the transcriptional and epigenetic regulatory mechanisms, computational analyses of ChIP-seq also become increasingly comprehensive and sophisticated. In this article, we review current computational methodology for ChIP-seq analysis, recommend useful algorithms and workflows, and introduce quality control measures at different analytical steps. We also discuss how ChIP-seq could be integrated with other types of genomic assays, such as gene expression profiling and genome-wide association studies, to provide a more comprehensive view of gene regulatory mechanisms in important physiological and pathological processes.