Cover illustration
Histone modifications play an important role in defining chromatin states and regulating gene expression. Many histone modifications, such as H3K27me3 and H3K9me3, form broad domains in the genome. Measured by ChIP-seq, board histone domains look like mountain ridges in the data and are more difficult to identify than sharp peaks. In this issue, Zang et al. present a computational method, RECOGNICER, for identifying cross-scale board domains from ChIP-seq data using a coarse-[Detail] ...
Background: The extremely small amount of DNA in a cell makes it difficult to study the whole genome of single cells, so whole-genome amplification (WGA) is necessary to increase the DNA amount and enable downstream analyses. Multiple displacement amplification (MDA) is the most widely used WGA technique.
Results: Compared with amplification methods based on PCR and other methods, MDA renders high-quality DNA products and better genome coverage by using phi29 DNA polymerase. Moreover, recently developed advanced MDA technologies such as microreactor MDA, emulsion MDA, and micro-channel MDA have improved amplification uniformity. Additionally, the development of other novel methods such as TruePrime WGA allows for amplification without primers.
Conclusion: Here, we reviewed a selection of recently developed MDA methods, their advantages over other WGA methods, and improved MDA-based technologies, followed by a discussion of future perspectives. With the continuous development of MDA and the successive update of detection technologies, MDA will be applied in increasingly more fields and provide a solid foundation for scientific research.
Background: High-order chromatin structure has been shown to play a vital role in gene regulation. Previously we identified two types of sequence domains, CGI (CpG island) forest and CGI prairie, which tend to spatially segregate, but to different extent in different tissues. Here we aim to further quantify the association of domain segregation with gene regulation and therefore differentiation.
Methods: By means of the published RNA-seq and Hi-C data, we identified tissue-specific genes and quantitatively investigated how their regulation is relevant to chromatin structure. Besides, two types of gene networks were constructed and the association between gene pair co-regulation and genome organization is discussed.
Results: We show that compared to forests, tissue-specific genes tend to be enriched in prairies. Highly specific genes also tend to cluster according to their functions in a relatively small number of prairies. Furthermore, tissue-specific forest-prairie contact formation was associated with the regulation of tissue-specific genes, in particular those in the prairie domains, pointing to the important role of gene positioning, in the linear DNA sequence as well as in 3D chromatin structure, in gene regulatory network formation.
Conclusion: We investigated how gene regulation is related to genome organization from the perspective of forest-prairie spatial interactions. Since unlike compartments A and B, forest and prairie are identified solely based on sequence properties. Therefore, the simple and uniform framework (forest-prairie domain segregation) provided here can be utilized to further understand the chromatin structure changes as well as the underlying biological significances in different stages, such as tumorgenesis.
Background: Herpes simplex virus type 1 (HSV-1) is a ubiquitous infectious pathogen that widely affects human health. To decipher the complicated human-HSV-1 interactions, a comprehensive protein-protein interaction (PPI) network between human and HSV-1 is highly demanded.
Methods: To complement the experimental identification of human-HSV-1 PPIs, an integrative strategy to predict proteome-wide PPIs between human and HSV-1 was developed. For each human-HSV-1 protein pair, four popular PPI inference methods, including interolog mapping, the domain-domain interaction-based method, the domain-motif interaction-based method, and the machine learning-based method, were optimally implemented to generate four interaction probability scores, which were further integrated into a final probability score.
Results: As a result, a comprehensive high-confidence PPI network between human and HSV-1 was established, covering 10,432 interactions between 4,546 human proteins and 72 HSV-1 proteins. Functional and network analyses of the HSV-1 targeting proteins in the context of human interactome can recapitulate the known knowledge regarding the HSV-1 replication cycle, supporting the overall reliability of the predicted PPI network. Considering that HSV-1 infections are implicated in encephalitis and neurodegenerative diseases, we focused on exploring the biological significance of the brain-specific human-HSV-1 PPIs. In particular, the predicted interactions between HSV-1 proteins and Alzheimer’s-disease-related proteins were intensively investigated.
Conclusion: The current work can provide testable hypotheses to assist in the mechanistic understanding of the human-HSV-1 relationship and the anti-HSV-1 pharmaceutical target discovery. To make the predicted PPI network and the datasets freely accessible to the scientific community, a user-friendly database browser was released at http://www.zzdlab.com/HintHSV/index.php.
Background: COVID-19 has been impacting on the whole world critically and constantly since late December 2019. Rapidly increasing infections has raised intense worldwide attention. How to model the evolution of COVID-19 effectively and efficiently is of great significance for prevention and control.
Methods: We propose the multi-chain Fudan-CCDC model based on the original single-chain model in [Shao et al. 2020] to describe the evolution of COVID-19 in Singapore. Multi-chains can be considered as the superposition of several single chains with different characteristics. We identify the parameters of models by minimizing the penalty function.
Results: The numerical simulation results exhibit the multi-chain model performs well on data fitting. Though unsteady the increments are, they could still fall within the range of ±30% fluctuation from simulation results.
Conclusion: The multi-chain Fudan-CCDC model provides an effective way to early detect the appearance of imported infectors and super spreaders and forecast a second outbreak. It can also explain the data from those countries where the single-chain model shows deviation from the data.
Background: Identifying patient-specific flow of signal transduction perturbed by multiple single-nucleotide alterations is critical for improving patient outcomes in cancer cases. However, accurate estimation of mutational effects at the pathway level for such patients remains an open problem. While probabilistic pathway topology methods are gaining interest among the scientific community, the overwhelming majority do not account for network perturbation effects from multiple single-nucleotide alterations.
Methods: Here we present an improvement of the mutational forks formalism to infer the patient-specific flow of signal transduction based on multiple single-nucleotide alterations, including non-synonymous and synonymous mutations. The lung adenocarcinoma and skin cutaneous melanoma datasets from TCGA Pan-Cancer Atlas have been employed to show the utility of the proposed method.
Results: We have comprehensively characterized six mutational forks. The number of mutated nodes ranged from one to four depending on the topological characteristics of a fork. Transitional confidences (TCs) have been computed for every possible combination of single-nucleotide alterations in the fork. The performed analysis demonstrated the capacity of the mutational forks formalism to follow a biologically explainable logic in the identification of high-likelihood signaling routes in lung adenocarcinoma and skin cutaneous melanoma patients. The findings have been largely supported by the evidence from the biomedical literature.
Conclusion: We conclude that the formalism has a great chance to enable an assessment of patient-specific flow by leveraging information from multiple single-nucleotide alterations to adjust the transitional likelihoods that are solely based on the canonical view of a disease.
Background: With improvements in next-generation DNA sequencing technology, lower cost is needed to collect genetic data. More machine learning techniques can be used to help with cancer analysis and diagnosis.
Methods: We developed an ensemble machine learning system named performance-weighted-voting model for cancer type classification in 6,249 samples across 14 cancer types. Our ensemble system consists of five weak classifiers (logistic regression, SVM, random forest, XGBoost and neural networks). We first used cross-validation to get the predicted results for the five classifiers. The weights of the five weak classifiers can be obtained based on their predictive performance by solving linear regression functions. The final predicted probability of the performance-weighted-voting model for a cancer type can be determined by the summation of each classifier’s weight multiplied by its predicted probability.
Results: Using the somatic mutation count of each gene as the input feature, the overall accuracy of the performance-weighted-voting model reached 71.46%, which was significantly higher than the five weak classifiers and two other ensemble models: the hard-voting model and the soft-voting model. In addition, by analyzing the predictive pattern of the performance-weighted-voting model, we found that in most cancer types, higher tumor mutational burden can improve overall accuracy.
Conclusion: This study has important clinical significance for identifying the origin of cancer, especially for those where the primary cannot be determined. In addition, our model presents a good strategy for using ensemble systems for cancer type classification.
Background: Histone modifications are major factors that define chromatin states and have functions in regulating gene expression in eukaryotic cells. Chromatin immunoprecipitation coupled with high-throughput sequencing (ChIP-seq) technique has been widely used for profiling the genome-wide distribution of chromatin-associating protein factors. Some histone modifications, such as H3K27me3 and H3K9me3, usually mark broad domains in the genome ranging from kilobases (kb) to megabases (Mb) long, resulting in diffuse patterns in the ChIP-seq data that are challenging for signal separation. While most existing ChIP-seq peak-calling algorithms are based on local statistical models without account of multi-scale features, a principled method to identify scale-free board domains has been lacking.
Methods: Here we present RECOGNICER (Recursive coarse-graining identification for ChIP-seq enriched regions), a computational method for identifying ChIP-seq enriched domains on a large range of scales. The algorithm is based on a coarse-graining approach, which uses recursive block transformations to determine spatial clustering of local enriched elements across multiple length scales.
Results: We apply RECOGNICER to call H3K27me3 domains from ChIP-seq data, and validate the results based on H3K27me3’s association with repressive gene expression. We show that RECOGNICER outperforms existing ChIP-seq broad domain calling tools in identifying more whole domains than separated pieces.
Conclusion: RECOGNICER can be a useful bioinformatics tool for next-generation sequencing data analysis in epigenomics research.
Background: RNA secondary structures play a pivotal role in posttranscriptional regulation and the functions of non-coding RNAs, yet in vivo RNA secondary structures remain enigmatic. PARIS (Psoralen Analysis of RNA Interactions and Structures) is a recently developed high-throughput sequencing-based approach that enables direct capture of RNA duplex structures in vivo. However, the existence of incompatible, fuzzy pairing information obstructs the integration of PARIS data with the existing tools for reconstructing RNA secondary structure models at the single-base resolution.
Methods: We introduce IRIS, a method for predicting RNA secondary structure ensembles based on PARIS data. IRIS generates a large set of candidate RNA secondary structure models under the guidance of redistributed PARIS reads and then uses a Bayesian model to identify the optimal ensemble, according to both thermodynamic principles and PARIS data.
Results: The predicted RNA structure ensembles by IRIS have been verified based on evolutionary conservation information and consistency with other experimental RNA structural data. IRIS is implemented in Python and freely available at http://iris.zhanglab.net.
Conclusion: IRIS capitalizes upon PARIS data to improve the prediction of in vivo RNA secondary structure ensembles. We expect that IRIS will enhance the application of the PARIS technology and shed more insight on in vivo RNA secondary structures.