Background: The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture. Existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data.
Methods: Here we developed a reference-free and alignment-free machine learning method, DeepVirFinder, for identifying viral sequences in metagenomic data using deep learning.
Results: Trained based on sequences from viral RefSeq discovered before May 2015, and evaluated on those discovered after that date, DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths, achieving AUROC 0.93, 0.95, 0.97, and 0.98 for 300, 500, 1000, and 3000 bp sequences respectively. Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under-represented. Applying DeepVirFinder to real human gut metagenomic samples, we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma (CRC). Ten bins were found associated with the cancer status, suggesting viruses may play important roles in CRC.
Conclusions: Powered by deep learning and high throughput sequencing metagenomic data, DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.
Background: In recent years, since the molecular docking technique can greatly improve the efficiency and reduce the research cost, it has become a key tool in computer-assisted drug design to predict the binding affinity and analyze the interactive mode.
Results: This study introduces the key principles, procedures and the widely-used applications for molecular docking. Also, it compares the commonly used docking applications and recommends which research areas are suitable for them. Lastly, it briefly reviews the latest progress in molecular docking such as the integrated method and deep learning.
Conclusion: Limited to the incomplete molecular structure and the shortcomings of the scoring function, current docking applications are not accurate enough to predict the binding affinity. However, we could improve the current molecular docking technique by integrating the big biological data into scoring function.
Background: The coronavirus disease 2019 (COVID-19) is rapidly spreading in China and more than 30 countries over last two months. COVID-19 has multiple characteristics distinct from other infectious diseases, including high infectivity during incubation, time delay between real dynamics and daily observed number of confirmed cases, and the intervention effects of implemented quarantine and control measures.
Methods: We develop a Susceptible, Un-quanrantined infected, Quarantined infected, Confirmed infected (SUQC) model to characterize the dynamics of COVID-19 and explicitly parameterize the intervention effects of control measures, which is more suitable for analysis than other existing epidemic models.
Results: The SUQC model is applied to the daily released data of the confirmed infections to analyze the outbreak of COVID-19 in Wuhan, Hubei (excluding Wuhan), China (excluding Hubei) and four first-tier cities of China. We found that, before January 30, 2020, all these regions except Beijing had a reproductive number
Conclusions: We suggest that rigorous quarantine and control measures should be kept before early March in Beijing, Shanghai, Guangzhou and Shenzhen, and before late March in Hubei. The model can also be useful to predict the trend of epidemic and provide quantitative guide for other countries at high risk of outbreak, such as South Korea, Japan, Italy and Iran.
The CRISPR-Cas9 system, naturally a defense mechanism in prokaryotes, has been repurposed as an RNA-guided DNA targeting platform. It has been widely used for genome editing and transcriptome modulation, and has shown great promise in correcting mutations in human genetic diseases. Off-target effects are a critical issue for all of these applications. Here we review the current status on the target specificity of the CRISPR-Cas9 system.
The specificity of protein-DNA interactions is most commonly modeled using position weight matrices (PWMs). First introduced in 1982, they have been adapted to many new types of data and many different approaches have been developed to determine the parameters of the PWM. New high-throughput technologies provide a large amount of data rapidly and offer an unprecedented opportunity to determine accurately the specificities of many transcription factors (TFs). But taking full advantage of the new data requires advanced algorithms that take into account the biophysical processes involved in generating the data. The new large datasets can also aid in determining when the PWM model is inadequate and must be extended to provide accurate predictions of binding sites. This article provides a general mathematical description of a PWM and how it is used to score potential binding sites, a brief history of the approaches that have been developed and the types of data that are used with an emphasis on algorithms that we have developed for analyzing high-throughput datasets from several new technologies. It also describes extensions that can be added when the simple PWM model is inadequate and further enhancements that may be necessary. It briefly describes some applications of PWMs in the discovery and modeling of
Background: Many existing bioinformatics predictors are based on machine learning technology. When applying these predictors in practical studies, their predictive performances should be well understood. Different performance measures are applied in various studies as well as different evaluation methods. Even for the same performance measure, different terms, nomenclatures or notations may appear in different context.
Results: We carried out a review on the most commonly used performance measures and the evaluation methods for bioinformatics predictors.
Conclusions: It is important in bioinformatics to correctly understand and interpret the performance, as it is the key to rigorously compare performances of different predictors and to choose the right predictor.
Background: In the human genome, distal enhancers are involved in regulating target genes through proximal promoters by forming enhancer-promoter interactions. Although recently developed high-throughput experimental approaches have allowed us to recognize potential enhancer-promoter interactions genome-wide, it is still largely unclear to what extent the sequence-level information encoded in our genome help guide such interactions.
Methods: Here we report a new computational method (named “SPEID”) using deep learning models to predict enhancer-promoter interactions based on sequence-based features only, when the locations of putative enhancers and promoters in a particular cell type are given.
Results: Our results across six different cell types demonstrate that SPEID is effective in predicting enhancer-promoter interactions as compared to state-of-the-art methods that only use information from a single cell type. As a proof-of-principle, we also applied SPEID to identify somatic non-coding mutations in melanoma samples that may have reduced enhancer-promoter interactions in tumor genomes.
Conclusions: This work demonstrates that deep learning models can help reveal that sequence-based features alone are sufficient to reliably predict enhancer-promoter interactions genome-wide.
Non-smooth or even abrupt state changes exist during many biological processes, e.g., cell differentiation processes, proliferation processes, or even disease deterioration processes. Such dynamics generally signals the emergence of critical transition phenomena, which result in drastic changes of system states or eventually qualitative changes of phenotypes. Hence, it is of great importance to detect such transitions and further reveal their molecular mechanisms at network level. Here, we review the recent advances on dynamical network biomarkers (DNBs) as well as the related theoretical foundation, which can identify not only early signals of the critical transitions but also their leading networks, which drive the whole system to initiate such transitions. In order to demonstrate the effectiveness of this novel approach, examples of complex diseases are also provided to detect pre-disease stage, for which traditional methods or biomarkers failed.
Background: Since the invention of next-generation RNA sequencing (RNA-seq) technologies, they have become a powerful tool to study the presence and quantity of RNA molecules in biological samples and have revolutionized transcriptomic studies. The analysis of RNA-seq data at four different levels (samples, genes, transcripts, and exons) involve multiple statistical and computational questions, some of which remain challenging up to date.
Results: We review RNA-seq analysis tools at the sample, gene, transcript, and exon levels from a statistical perspective. We also highlight the biological and statistical questions of most practical considerations.
Conclusions: The development of statistical and computational methods for analyzing RNA-seq data has made significant advances in the past decade. However, methods developed to answer the same biological question often rely on diverse statistical models and exhibit different performance under different scenarios. This review discusses and compares multiple commonly used statistical models regarding their assumptions, in the hope of helping users select appropriate methods as needed, as well as assisting developers for future method development.
Background: Next-generation sequencing (NGS) technologies have fostered an unprecedented proliferation of high-throughput sequencing projects and a concomitant development of novel algorithms for the assembly of short reads. However, numerous technical or computational challenges in de novo assembly still remain, although many new ideas and solutions have been suggested to tackle the challenges in both experimental and computational settings.
Results: In this review, we first briefly introduce some of the major challenges faced by NGS sequence assembly. Then, we analyze the characteristics of various sequencing platforms and their impact on assembly results. After that, we classify de novo assemblers according to their frameworks (overlap graph-based, de Bruijn graph-based and string graph-based), and introduce the characteristics of each assembly tool and their adaptation scene. Next, we introduce in detail the solutions to the main challenges of de novo assembly of next generation sequencing data, single-cell sequencing data and single molecule sequencing data. At last, we discuss the application of SMS long reads in solving problems encountered in NGS assembly.
Conclusions: This review not only gives an overview of the latest methods and developments in assembly algorithms, but also provides guidelines to determine the optimal assembly algorithm for a given input sequencing data type.
One goal of precise oncology is to re-classify cancer based on molecular features rather than its tissue origin. Integrative clustering of large-scale multi-omics data is an important way for molecule-based cancer classification. The data heterogeneity and the complexity of inter-omics variations are two major challenges for the integrative clustering analysis. According to the different strategies to deal with these difficulties, we summarized the clustering methods as three major categories: direct integrative clustering, clustering of clusters and regulatory integrative clustering. A few practical considerations on data pre-processing, post-clustering analysis and pathway-based analysis are also discussed.
Background: Random Forests is a popular classification and regression method that has proven powerful for various prediction problems in biological studies. However, its performance often deteriorates when the number of features increases. To address this limitation, feature elimination Random Forests was proposed that only uses features with the largest variable importance scores. Yet the performance of this method is not satisfying, possibly due to its rigid feature selection, and increased correlations between trees of forest.
Methods: We propose variable importance-weighted Random Forests, which instead of sampling features with equal probability at each node to build up trees, samples features according to their variable importance scores, and then select the best split from the randomly selected features.
Results: We evaluate the performance of our method through comprehensive simulation and real data analyses, for both regression and classification. Compared to the standard Random Forests and the feature elimination Random Forests methods, our proposed method has improved performance in most cases.
Conclusions: By incorporating the variable importance scores into the random feature selection step, our method can better utilize more informative features without completely ignoring less informative ones, hence has improved prediction accuracy in the presence of weak signals and large noises. We have implemented an R package “viRandomForests” based on the original R package “randomForest” and it can be freely downloaded from http://zhaocenter.org/software.
Results: We review current practices in analysis of structure profiling data with emphasis on comparative and integrative analysis as well as highlight emerging questions. Comparative analysis has revealed structural patterns across transcriptomes and has become an integral component of recent profiling studies. Additionally, profiling data can be integrated into traditional structure prediction algorithms to improve prediction accuracy.
Conclusions: To keep pace with experimental developments, methods to facilitate, enhance and refine such analyses are needed. Parallel advances in analysis methodology will complement profiling technologies and help them reach their full potential.
The species accumulation curve, or collector’s curve, of a population gives the expected number of observed species or distinct classes as a function of sampling effort. Species accumulation curves allow researchers to assess and compare diversity across populations or to evaluate the benefits of additional sampling. Traditional applications have focused on ecological populations but emerging large-scale applications, for example in DNA sequencing, are orders of magnitude larger and present new challenges. We developed a method to estimate accumulation curves for predicting the complexity of DNA sequencing libraries. This method uses rational function approximations to a classical non-parametric empirical Bayes estimator due to Good and Toulmin [Biometrika, 1956, 43, 45–63]. Here we demonstrate how the same approach can be highly effective in other large-scale applications involving biological data sets. These include estimating microbial species richness, immune repertoire size, and k-mer diversity for genome assembly applications. We show how the method can be modified to address populations containing an effectively infinite number of species where saturation cannot practically be attained. We also introduce a flexible suite of tools implemented as an R package that make these methods broadly accessible.
Background: The therapeutic potential of bacteriophages has been debated since their first isolation and characterisation in the early 20th century. However, a lack of consistency in application and observed efficacy during their early use meant that upon the discovery of antibiotic compounds research in the field of phage therapy quickly slowed. The rise of antibiotic resistance in bacteria and improvements in our abilities to modify and manipulate DNA, especially in the context of small viral genomes, has led to a recent resurgence of interest in utilising phage as antimicrobial therapeutics.
Results: In this article a number of results from the literature that have aimed to address key issues regarding the utility and efficacy of phage as antimicrobial therapeutics utilising molecular biology and synthetic biology approaches will be introduced and discussed, giving a general view of the recent progress in the field.
Conclusions: Advances in molecular biology and synthetic biology have enabled rapid progress in the field of phage engineering, with this article highlighting a number of promising strategies developed to optimise phages for the treatment of bacterial disease. Whilst many of the same issues that have historically limited the use of phages as therapeutics still exist, these modifications, or combinations thereof, may form a basis upon which future advances can be built. A focus on rigorous in vivo testing and investment in clinical trials for promising candidate phages may be required for the field to truly mature, but there is renewed hope that the potential benefits of phage therapy may finally be realised.
Background: Most eukaryotic protein-coding genes exhibit alternative cleavage and polyadenylation (APA), resulting in mRNA isoforms with different 3′ untranslated regions (3′ UTRs). Studies have shown that brain cells tend to express long 3′ UTR isoforms using distal cleavage and polyadenylation sites (PASs).
Methods: Using our recently developed, comprehensive PAS database PolyA_DB, we developed an efficient method to examine APA, named Significance Analysis of Alternative Polyadenylation using RNA-seq (SAAP-RS). We applied this method to study APA in brain cells and neurogenesis.
Results: We found that neurons globally express longer 3′ UTRs than other cell types in brain, and microglia and endothelial cells express substantially shorter 3′ UTRs. We show that the 3′ UTR diversity across brain cells can be corroborated with single cell sequencing data. Further analysis of APA regulation of 3′ UTRs during differentiation of embryonic stem cells into neurons indicates that a large fraction of the APA events regulated in neurogenesis are similarly modulated in myogenesis, but to a much greater extent.
Conclusion: Together, our data delineate APA profiles in different brain cells and indicate that APA regulation in neurogenesis is largely an augmented process taking place in other types of cell differentiation.
Background: De novo genome assembly relies on two kinds of graphs: de Bruijn graphs and overlap graphs. Overlap graphs are the basis for the Celera assembler, while de Bruijn graphs have become the dominant technical device in the last decade. Those two kinds of graphs are collectively called assembly graphs.
Results: In this review, we discuss the most recent advances in the problem of constructing, representing and navigating assembly graphs, focusing on very large datasets. We will also explore some computational techniques, such as the Bloom filter, to compactly store graphs while keeping all functionalities intact.
Conclusions: We complete our analysis with a discussion on the algorithmic issues of assembling from long reads (e.g., PacBio and Oxford Nanopore). Finally, we present some of the most relevant open problems in this field.
Background: Single-cell RNA-sequencing (scRNA-seq) is a rapidly evolving technology that enables measurement of gene expression levels at an unprecedented resolution. Despite the explosive growth in the number of cells that can be assayed by a single experiment, scRNA-seq still has several limitations, including high rates of dropouts, which result in a large number of genes having zero read count in the scRNA-seq data, and complicate downstream analyses.
Methods: To overcome this problem, we treat zeros as missing values and develop nonparametric deep learning methods for imputation. Specifically, our LATE (Learning with AuToEncoder) method trains an autoencoder with random initial values of the parameters, whereas our TRANSLATE (TRANSfer learning with LATE) method further allows for the use of a reference gene expression data set to provide LATE with an initial set of parameter estimates.
Results: On both simulated and real data, LATE and TRANSLATE outperform existing scRNA-seq imputation methods, achieving lower mean squared error in most cases, recovering nonlinear gene-gene relationships, and better separating cell types. They are also highly scalable and can efficiently process over 1 million cells in just a few hours on a GPU.
Conclusions: We demonstrate that our nonparametric approach to imputation based on autoencoders is powerful and highly efficient.
Experimental evidences and theoretical analyses have amply suggested that in cancer genesis and progression genetic information is very important but not the whole. Nevertheless, “cancer as a disease of the genome” is still currently the dominant doctrine. With such a background and based on the fundamental properties of biological systems, a new endogenous molecular-cellular network theory for cancer was recently proposed by us. Similar proposals were also made by others. The new theory attempts to incorporate both genetic and environmental effects into one single framework, with the possibility to give a quantitative and dynamical description. It is asserted that the complex regulatory machinery behind biological processes may be modeled by a nonlinear stochastic dynamical system similar to a noise perturbed Morse-Smale system. Both qualitative and quantitative descriptions may be obtained. The dynamical variables are specified by a set of endogenous molecular-cellular agents and the structure of the dynamical system by the interactions among those biological agents. Here we review this theory from a pedagogical angle which emphasizes the role of modularization, hierarchy and autonomous regulation. We discuss how the core set of assumptions is exemplified in detail in one of the simple, important and well studied model organisms, Phage lambda. With this concrete and quantitative example in hand, we show that the application of the hypothesized theory in human cancer, such as hepatocellular carcinoma (HCC), is plausible, and that it may provide a set of new insights on understanding cancer genesis and progression, and on strategies for cancer prevention, cure, and care.
The rapid technological developments following the Human Genome Project have made possible the availability of personalized genomes. As the focus now shifts from characterizing genomes to making personalized disease associations, in combination with the availability of other omics technologies, the next big push will be not only to obtain a personalized genome, but to quantitatively follow other omics. This will include transcriptomes, proteomes, metabolomes, antibodyomes, and new emerging technologies, enabling the profiling of thousands of molecular components in individuals. Furthermore, omics profiling performed longitudinally can probe the temporal patterns associated with both molecular changes and associated physiological health and disease states. Such data necessitates the development of computational methodology to not only handle and descriptively assess such data, but also construct quantitative biological models. Here we describe the availability of personal genomes and developing omics technologies that can be brought together for personalized implementations and how these novel integrated approaches may effectively provide a precise personalized medicine that focuses on not only characterization and treatment but ultimately the prevention of disease.
Understanding how chromosomes fold provides insights into the transcription regulation, hence, the functional state of the cell. Using the next generation sequencing technology, the recently developed Hi-C approach enables a global view of spatial chromatin organization in the nucleus, which substantially expands our knowledge about genome organization and function. However, due to multiple layers of biases, noises and uncertainties buried in the protocol of Hi-C experiments, analyzing and interpreting Hi-C data poses great challenges, and requires novel statistical methods to be developed. This article provides an overview of recent Hi-C studies and their impacts on biomedical research, describes major challenges in statistical analysis of Hi-C data, and discusses some perspectives for future research.
Background: Self-sustained oscillations are a ubiquitous and vital phenomenon in living systems. From primitive single-cellular bacteria to the most sophisticated organisms, periodicities have been observed in a broad spectrum of biological processes such as neuron firing, heart beats, cell cycles, circadian rhythms, etc. Defects in these oscillators can cause diseases from insomnia to cancer. Elucidating their fundamental mechanisms is of great significance to diseases, and yet challenging, due to the complexity and diversity of these oscillators.
Results: Approaches in quantitative systems biology and synthetic biology have been most effective by simplifying the systems to contain only the most essential regulators. Here, we will review major progress that has been made in understanding biological oscillators using these approaches. The quantitative systems biology approach allows for identification of the essential components of an oscillator in an endogenous system. The synthetic biology approach makes use of the knowledge to design the simplest, de novo oscillators in both live cells and cell-free systems. These synthetic oscillators are tractable to further detailed analysis and manipulations.
Conclusion: With the recent development of biological and computational tools, both approaches have made significant achievements.
Background: Mendelian randomization (MR) analysis has become popular in inferring and estimating the causality of an exposure on an outcome due to the success of genome wide association studies. Many statistical approaches have been developed and each of these methods require specific assumptions.
Results: In this article, we review the pros and cons of these methods. We use an example of high-density lipoprotein cholesterol on coronary artery disease to illuminate the challenges in Mendelian randomization investigation.
Conclusion: The current available MR approaches allow us to study causality among risk factors and outcomes. However, novel approaches are desirable for overcoming multiple source confounding of risk factors and an outcome in MR analysis.
Background: Traditional Chinese medicine (TCM) treats diseases in a holistic manner, while TCM formulae are multi-component, multi-target agents at the molecular level. Thus there are many parallels between the key ideas of TCM pharmacology and network pharmacology. These years, TCM network pharmacology has developed as an interdisciplinary of TCM science and network pharmacology, which studies the mechanism of TCM at the molecular level and in the context of biological networks. It provides a new research paradigm that can use modern biomedical science to interpret the mechanism of TCM, which is promising to accelerate the modernization and internationalization of TCM.
Results: In this paper we introduce state-of-the-art free data sources, web servers and softwares that can be used in the TCM network pharmacology, including databases of TCM, drug targets and diseases, web servers for the prediction of drug targets, and tools for network and functional analysis.
Conclusions: This review could help experimental pharmacologists make better use of the existing data and methods in their study of TCM.
Background: Cellular non-coding RNAs are extensively modified post-transcriptionally, with more than 100 chemically distinct nucleotides identified to date. In the past five years, new sequencing based methods have revealed widespread decoration of eukaryotic messenger RNA with diverse RNA modifications whose functions in mRNA metabolism are only beginning to be known.
Results: Since most of the identified mRNA modifying enzymes are present in the nucleus, these modifications have the potential to function in nuclear pre-mRNA processing including alternative splicing. Here we review recent progress towards illuminating the role of pre-mRNA modifications in splicing and highlight key areas for future investigation in this rapidly growing field.
Conclusions: Future studies to identify which modifications are added to nascent pre-mRNA and to interrogate the direct effects of individual modifications are likely to reveal new mechanisms by which nuclear pre-mRNA processing is regulated.
Much of our current knowledge of biology has been constructed based on population-average measurements. However, advances in single-cell analysis have demonstrated the omnipresent nature of cell-to-cell variability in any population. On one hand, tremendous efforts have been made to examine how such variability arises, how it is regulated by cellular networks, and how it can affect cell-fate decisions by single cells. On the other hand, recent studies suggest that the variability may carry valuable information that can facilitate the elucidation of underlying regulatory networks or the classification of cell states. To this end, a major challenge is determining what aspects of variability bear significant biological meaning. Addressing this challenge requires the development of new computational tools, in conjunction with appropriately chosen experimental platforms, to more effectively describe and interpret data on cell-cell variability. Here, we discuss examples of when population heterogeneity plays critical roles in determining biologically and clinically significant phenotypes, how it serves as a rich information source of regulatory mechanisms, and how we can extract such information to gain a deeper understanding of biological systems.
Deep learning is making major breakthrough in several areas of bioinformatics. Anticipating that this will occur soon for the single-cell RNA-seq data analysis, we review newly published deep learning methods that help tackle computational challenges. Autoencoders are found to be the dominant approach. However, methods based on deep generative models such as generative adversarial networks (GANs) are also emerging in this area.