The evolutionary scale modeling (ESM) series is promising to revolutionize protein science and engineering through large language models (LLMs), providing a robust framework for understanding the relationships among sequences, structures, and functions of proteins. Trained on a large number of unlabeled protein sequences, ESM models are able to capture intricate patterns of mutation and conservation, yielding insights into the structural and functional properties of proteins. Despite a growing body of literature surrounding ESM, existing surveys often fail to comprehensively describe its advancements or applications in a focused manner. This survey covers the latest developments of ESM, categorizing them into techniques of using ESM and downstream applications. Approximately 100 papers are selected and analyzed, highlighting recognized and innovative studies that exemplify the impact of ESM. Furthermore, we critically discuss the strengths and limitations of ESM to envision future applications. This review provides a valuable resource for researchers seeking to explore the power of ESM models and the emerging applications of LLMs in biology and medicine.
With the rapid advancements in large language model technology and the emergence of bioinformatics-specific language models (BioLMs), there is a growing need for a comprehensive analysis of the current landscape, computational characteristics, and diverse applications. This survey aims to address this need by providing a thorough review of BioLMs, focusing on their evolution, classification, and distinguishing features, alongside a detailed examination of training methodologies, datasets, and evaluation frameworks. We explore the wide-ranging applications of BioLMs in critical areas such as disease diagnosis, drug discovery, and vaccine development, highlighting their impact and transformative potential in bioinformatics. We identify key challenges and limitations inherent in BioLMs, including data privacy and security concerns, interpretability issues, biases in training data and model outputs, and domain adaptation complexities. Finally, we high-light emerging trends and future directions, offering valuable insights to guide researchers and clinicians toward advancing BioLMs for increasingly sophisticated biological and clinical applications.
The detection of drug-drug interaction (DDI) is crucial to the rational use of drug combinations. Experimentally, DDI detection is time-consuming and laborious. Currently, researchers have developed a variety of computational methods to predict DDI. Although there are many reviews that summarized these computational methods, these reviews focused on supervised learning. In this review, we provide a comprehensive and systematic summary of unsupervised (i.e., clustering) methods for DDI network analysis. Unlike previous studies, we highlight the unique advantages of clustering methods DDI prediction and uncovering mechanisms of action. We first introduced common drug information and discussed how to calculate drug similarity using this drug information. Then, we introduced representative clustering algorithms (i.e., drug information-based and network-based methods) and described clustering evaluation metrics. Finally, we discussed the limitations and challenges in this field, and proposed potential research directions. This review aims to promote further exploration and application of clustering methods in drug combination discovery and DDI network analysis.
Drug-perturbed transcriptomes are important for personalized medicine and drug discovery. Nevertheless, the existing high-throughput screening and sequencing techniques for drug-perturbed transcriptomes remain expensive and time-consuming. In this study, we propose a novel multi-condition diffusion transformer model, designated as perturbation diffusion transformer (PertDiT), which is tailored for conditionally generating the perturbed transcriptomes based on drug text information. PertDiT combines the potent transformer architecture with the text representation of pre-trained large language models and utilizes a novel perturbation and transcriptome fusion modules. We have designed two network structures, namely, CrossDiT and CatCrossDiT, applicable to drug discovery and personalized medicine scenarios, respectively. Through a comprehensive set of metrics and an effective data splitting strategy, our model outperforms existing methods, demonstrating a superior ability in post-perturbation transcriptome reconstruction and the prediction of perturbation-induced transcriptional changes. The rationality and effectiveness of the model structure have also been meticulously validated.
Lineage tracing techniques have been developed rapidly in the past decades by employing new genetic engineering tools. However, due to their invasive nature, these are difficult to apply to humans. Although endogenous DNA mutations can be used for in vivo lineage tracing in humans, their extremely low mutation rate presents substantial technical challenges. Epimutations on DNA methylation happen at a rate of about 0.001 per CpG site per division. Such rich and stable information enables high-resolution, noninvasive lineage tracing in humans, as recently achieved with both MethylTree and EPI-Clone. MethylTree is a computational innovation that accurately predicts cell lineages from single-cell DNA methylation data, be it genome-wide or targeted. EPI-Clone is a targeted approach that requires careful CpG panel selection for specific tissues, which has been validated in blood. In this review, we present an overview of related historical studies, discuss the development of both MethylTree and EPI-Clone, and compare these two approaches. Although EPI-Clone is more scalable and cheaper, MethylTree has a higher resolution and works directly across different tissues. We demonstrate here that MethylTree also works well with EPI-Clone data, thus providing a unified solution for epimutation-based lineage tracing. Finally, we highlight the advantages of epimutation-based lineage tracing, discuss future directions for tool development, and touch on considerations in biological applications. Epimutation-based lineage tracing opens up an exciting avenue for noninvasive lineage tracing in humans across many biological processes.
Lineage tracing using endogenous mitochondrial DNA (mtDNA) variants holds great promise for reconstructing the lineage histories of individual cells, with broad applications in oncology, developmental biology, and regenerative medicine. Unlike synthetic DNA barcoding techniques, mitochondrial lineage tracing does not require genetic engineering of exogenous genetic markers, and thus is particularly suitable for human clinical samples. Various experimental and computational methods have been developed to profile mtDNA variants from single-cell genomic, transcriptomic, and epigenomic sequencing data. Despite the technical advances, several challenges still limit the robustness of single-cell mitochondrial lineage tracing, such as random genetic drift, genetic bottlenecks, informative variant identification, and low mtDNA coverage. In this review, we systematically examine current experimental and computational approaches for analyzing mtDNA variants in single cells and discuss current challenges and future technical developments aimed at enhancing the robustness and applicability of single-cell mitochondrial lineage tracing.
Substrate inhibition in lactic acid bacteria (LAB) fermentation occurs when substrate concentration exceeds a critical value, leading to reduced cell growth and thus inefficient lactic acid production. Many efforts, including experimental and kinetic models, have been devoted to elucidate the possible mechanisms of substrate inhibition. However, the molecular and physiological basis of this phenomenon remains incompletely characterized. In this study, we propose a mechanistic two-pathway model that integrates a substrate-responsive molecular regulatory pathway into the typical substrate assimilation and microbial growth pathway. Our modeling analysis captures a global growth dynamics, including lag, exponential, and stationary phases over a wide range of initial substrate concentrations, with one set of parameters. Consequently, the results exhibit a significantly prolonged lag phase at high initial substrate concentrations. We test this model framework by combining the model results with the published experimental data of LAB batch fermentation such as Lactobacillus bulgaricus, Lactobacillus casei, and Lactiplantibacillus plantarum on lactose, demonstrating its universality beyond specific substrate-strain systems. Furthermore, the model simulations show that an appropriate preculture treatment for modulating the inoculum's physiological state of the population could be a possible approach to cope with the challenge of substrate inhibition at high-substrate environments. Finally, the model predictions of optimal microbial growth dynamics are investigated from various inoculum sizes. The proposed modeling approach provides novel insights into the connection between microbial fermentation and substrate supply, facilitating efficient substrate utilization in bioprocess engineering.
Precise prediction of drug-drug interactions (DDIs) is essential for pharmaceutical research and clinical applications to minimize adverse reactions, optimize therapies, and reduce costs. However, existing methods still face challenges in effectively integrating multidimensional drug features and fully utilizing edge features in molecular graphs, which are crucial for predicting DDIs precisely. Moreover, current methods may not adequately capture the complex relationships between different types of features, limiting predictive performance. This paper proposes the MFCN-DDI model for DDI type prediction. The model consists of a multimodal feature extraction module, a capsule network-based feature fusion module, and a DDI predictor module. In the multimodal feature extraction module, four kinds of features are used to provide rich and comprehensive representations for subsequent DDI type prediction, where molecular graph features are generated by considering molecular graphs with edge features. The capsule network-based feature fusion module captures complex feature relationships to generate high- quality integrated representations. In the DDI predictor module, multiclass and multilabel classification predictions are performed accurately. Experimental results show that MFCN-DDI outperforms existing comparison models in prediction tasks. Case studies further prove its practical applicability. In summary, MFCN-DDI provides an efficient and reliable solution for DDI prediction.
Predicting drug-target affinity (DTA) is critical for discovering and developing hepatoprotective agents that can prevent and treat liver diseases. In this study, we propose BiGraph-DTA, a new predictive model for identifying DTA score prediction for hepatoprotective compounds by combining graph convolutional networks and bidirectional long short-term memory networks. This model is based on powerful frameworks that process both graph representations of molecular structures and sequential information from protein sequences to capture complex dependencies and interactions. Leveraging a curated hepatoprotective dataset (from ChEMBL) consisting of 21,421 interactions, the model outperforms traditional machine learning methods (such as random forest and XGBoost) as well as other deep learning methods (such as DeepDTA and GraphDTA) in terms of predictive performance. The BiGraph-DTA obtained the best mean squared error of 0.7885, R2 of 0.7208, and concordance index of 0.8508. Our proposed architecture holds potential for accelerating the drug discovery process of hepatoprotective therapy by highlighting the framework through which candidate drugs and their corresponding protein targets can be identified based on robust data-driven knowledge. This model, therefore, provides a new opportunity for discovering new hepatoprotective compounds, which may also make it possible to speed up finding new liver disease drugs.