Drug discovery is aimed to design novel molecules with specific chemical properties for the treatment of targeting diseases. Generally, molecular optimization is one important step in drug discovery, which optimizes the physical and chemical properties of a molecule. Currently, artificial intelligence techniques have shown excellent success in drug discovery, which has emerged as a new strategy to address the challenges of drug design including molecular optimization, and drastically reduce the costs and time for drug discovery. We review the latest advances of molecular optimization in artificial intelligence-based drug discovery, including data resources, molecular properties, optimization methodologies, and assessment criteria for molecular optimization. Specifically, we classify the optimization methodologies into molecular mapping-based, molecular distribution matching-based, and guided search-based methods, respectively, and discuss the principles of these methods as well as their pros and cons. Moreover, we highlight the current challenges in molecular optimization and offer a variety of perspectives, including interpretability, multidimensional optimization, and model generalization, on potential new lines of research to pursue in future. This study provides a comprehensive review of molecular optimization in artificial intelligence-based drug discovery, which points out the challenges as well as the new prospects. This review will guide researchers who are interested in artificial intelligence molecular optimization.
Developing methylotrophic cell factories that can efficiently catalyze organic one-carbon (C1) feedstocks derived from electrocatalytic reduction of carbon dioxide into bio-based chemicals and biofuels is of strategic significance for building a carbon-neutral, sustainable economic and industrial system. With the rapid advancement of RNA sequencing technology and mass spectrometer analysis, researchers have used these quantitative microbiology methods extensively, especially isotope-based metabolic flux analysis, to study the metabolic processes initiating from C1 feedstocks in natural C1-utilizing bacteria and synthetic C1 bacteria. This paper reviews the use of advanced quantitative analysis in recent years to understand the metabolic network and basic principles in the metabolism of natural C1-utilizing bacteria grown on methane, methanol, or formate. The acquired knowledge serves as a guide to rewire the central methylotrophic metabolism of natural C1-utilizing bacteria to improve the carbon conversion efficiency, and to engineer non-C1-utilizing bacteria into synthetic strains that can use C1 feedstocks as the sole carbon and energy source. These progresses ultimately enhance the design and construction of highly efficient C1-based cell factories to synthesize diverse high value-added products. The integration of quantitative biology and synthetic biology will advance the iterative cycle of understand–design–build–testing–learning to enhance C1-based biomanufacturing in the future.
The prediction of drug-drug interactions (DDIs) is a crucial task for drug safety research, and identifying potential DDIs helps us to explore the mechanism behind combinatorial therapy. Traditional wet chemical experiments for DDI are cumbersome and time-consuming, and are too small in scale, limiting the efficiency of DDI predictions. Therefore, it is particularly crucial to develop improved computational methods for detecting drug interactions. With the development of deep learning, several computational models based on deep learning have been proposed for DDI prediction. In this review, we summarized the high-quality DDI prediction methods based on deep learning in recent years, and divided them into four categories: neural network-based methods, graph neural network-based methods, knowledge graph-based methods, and multimodal-based methods. Furthermore, we discuss the challenges of existing methods and future potential perspectives. This review reveals that deep learning can significantly improve DDI prediction performance compared to traditional machine learning. Deep learning models can scale to large-scale datasets and accept multiple data types as input, thus making DDI predictions more efficient and accurate.
Recent advances in single-cell chromatin accessibility sequencing (scCAS) technologies have resulted in new insights into the characterization of epigenomic heterogeneity and have increased the need for automatic cell type annotation. However, existing automatic annotation methods for scCAS data fail to incorporate the reference data and neglect novel cell types, which only exist in a test set. Here, we propose RAINBOW, a reference-guided automatic annotation method based on the contrastive learning framework, which is capable of effectively identifying novel cell types in a test set. By utilizing contrastive learning and incorporating reference data, RAINBOW can effectively characterize the heterogeneity of cell types, thereby facilitating more accurate annotation. With extensive experiments on multiple scCAS datasets, we show the advantages of RAINBOW over state-of-the-art methods in known and novel cell type annotation. We also verify the effectiveness of incorporating reference data during the training process. In addition, we demonstrate the robustness of RAINBOW to data sparsity and number of cell types. Furthermore, RAINBOW provides superior performance in newly sequenced data and can reveal biological implication in downstream analyses. All the results demonstrate the superior performance of RAINBOW in cell type annotation for scCAS data. We anticipate that RAINBOW will offer essential guidance and great assistance in scCAS data analysis. The source codes are available at the GitHub website (BioX-NKU/RAINBOW).
Cardiovascular disease (CVD) is the major cause of death in many regions around the world, and several of its risk factors might be linked to diets. To improve public health and the understanding of this topic, we look at the recent Minnesota Coronary Experiment (MCE) analysis that used t-test and Cox model to evaluate CVD risks. However, these parametric methods might suffer from three problems: small sample size, right-censored bias, and lack of long-term evidence. To overcome the first of these challenges, we utilize a nonparametric permutation test to examine the relationship between dietary fats and serum total cholesterol. To address the second problem, we use a resampling-based rank test to examine whether the serum total cholesterol level affects CVD deaths. For the third issue, we use some extra-Framingham Heart Study (FHS) data with an A/B test to look for meta-relationship between diets, risk factors, and CVD risks. We show that, firstly, the link between low saturated fat diets and reduction in serum total cholesterol is strong. Secondly, reducing serum total cholesterol does not robustly have an impact on CVD hazards in the diet group. Lastly, the A/B test result suggests a more complicated relationship regarding abnormal diastolic blood pressure ranges caused by diets and how these might affect the associative link between the cholesterol level and heart disease risks. This study not only helps us to deeply analyze the MCE data but also, in combination with the long-term FHS data, reveals possible complex relationships behind diets, risk factors, and heart disease.
Complicated molecular alterations in tumors generate various mutant peptides. Some of these mutant peptides can be presented to the cell surface and then elicit immune responses, and such mutant peptides are called neoantigens. Accurate detection of neoantigens could help to design personalized cancer vaccines. Although some computational frameworks for neoantigen detection have been proposed, most of them can only detect SNV- and indel-derived neoantigens. In addition, current frameworks adopt oversimplified neoantigen prioritization strategies. These factors hinder the comprehensive and effective detection of neoantigens. We developed NeoHunter, flexible software to systematically detect and prioritize neoantigens from sequencing data in different formats. NeoHunter can detect not only SNV- and indel-derived neoantigens but also gene fusion- and aberrant splicing-derived neoantigens. NeoHunter supports both direct and indirect immunogenicity evaluation strategies to prioritize candidate neoantigens. These strategies utilize binding characteristics, existing biological big data, and T-cell receptor specificity to ensure accurate detection and prioritization. We applied NeoHunter to the TESLA dataset, cohorts of melanoma and non-small cell lung cancer patients. NeoHunter achieved high performance across the TESLA cancer patients and detected 79% (27 out of 34) of validated neoantigens in total. SNV- and indel-derived neoantigens accounted for 90% of the top 100 candidate neoantigens while neoantigens from aberrant splicing accounted for 9%. Gene fusion-derived neoantigens were detected in one patient. NeoHunter is a powerful tool to ‘catch all’ neoantigens and is available for free academic use on Github (XuegongLab/NeoHunter).
To investigate the impact of hyperglycemia on the prognosis of patients with gastric cancer and identify key molecules associated with high glucose levels in gastric cancer development, RNA sequencing data and clinical features of gastric cancer patients were obtained from The Cancer Genome Atlas (TCGA) database. High glucose-related genes strongly associated with gastric cancer were identified using weighted gene co-expression network and differential analyses. A gastric cancer prognosis signature was constructed based on these genes and patients were categorized into high- and low-risk groups. The immune statuses of the two patient groups were compared. ATP citrate lyase (ACLY), a gene significantly related to the prognosis, was found to be upregulated upon high-glucose stimulation. Immunohistochemistry and molecular analyses confirmed high ACLY expression in gastric cancer tissues and cells. Gene Set Enrichment Analysis (GSEA) revealed the involvement of ACLY in cell cycle and DNA replication processes. Inhibition of ACLY affected the proliferation and migration of gastric cancer cells induced by high glucose levels. These findings suggest that ACLY, as a high glucose-related gene, plays a critical role in gastric cancer progression.
Protein biomarkers represent specific biological activities and processes, so they have had a critical role in cancer diagnosis and medical care for more than 50 years. With the recent improvement in proteomics technologies, thousands of protein biomarker candidates have been developed for diverse disease states. Studies have used different types of samples for proteomics diagnosis. Samples were pretreated with appropriate techniques to increase the selectivity and sensitivity of the downstream analysis and purified to remove the contaminants. The purified samples were analyzed by several principal proteomics techniques to identify the specific protein. In this study, recent improvements in protein biomarker discovery, verification, and validation are investigated. Furthermore, the advantages, and disadvantages of conventional techniques, are discussed. Studies have used mass spectroscopy (MS) as a critical technique in the identification and quantification of candidate biomarkers. Nevertheless, after protein biomarker discovery, verification and validation have been required to reduce the false-positive rate where there have been higher number of samples. Multiple reaction monitoring (MRM), parallel reaction monitoring (PRM), and selected reaction monitoring (SRM), in combination with stable isotope-labeled internal standards, have been examined as options for biomarker verification, and enzyme-linked immunosorbent assay (ELISA) for validation.
Although the principles of synthetic biology were initially established in model bacteria, microbial producers, extremophiles and gut microbes have now emerged as valuable prokaryotic chassis for biological engineering. Extending the host range in which designed circuits can function reliably and predictably presents a major challenge for the concept of synthetic biology to materialize. In this work, we systematically characterized the cross-species universality of two transcriptional regulatory modules—the T7 RNA polymerase activator module and the repressors module—in three non-model microbes. We found striking linear relationships in circuit activities among different organisms for both modules. Parametrized model fitting revealed host non-specific parameters defining the universality of both modules. Lastly, a genetic NOT gate and a band-pass filter circuit were constructed from these modules and tested in non-model organisms. Combined models employing host non-specific parameters were successful in quantitatively predicting circuit behaviors, underscoring the potential of universal biological parts and predictive modeling in synthetic bioengineering.
Transformer‐based foundation models such as ChatGPTs have revolutionized our daily life and affected many fields including bioinformatics. In this perspective, we first discuss about the direct application of textual foundation models on bioinformatics tasks, focusing on how to make the most out of canonical large language models and mitigate their inherent flaws. Meanwhile, we go through the transformer‐based, bioinformatics‐tailored foundation models for both sequence and non‐sequence data. In particular, we envision the further development directions as well as challenges for bioinformatics foundation models.
Drug-drug interaction (DDI) event prediction is a challenging problem, and accurate prediction of DDI events is critical to patient health and new drug development. Recently, many machine learning-based techniques have been proposed for predicting DDI events. However, most of the existing methods do not effectively integrate the multidimensional features of drugs and provide poor mitigation of noise to get effective feature information. To address these limitations, we propose a DDI-Transform neural network framework for DDI event prediction. In DDI-Transform, we design a drug structure information feature extraction module and a drug bind-protein feature extraction module to obtain multidimensional feature information. A stack of DDI-Transform layers in the DDI-Transform network module are then used for adaptive learning, thus adaptively selecting the effective feature information for prediction. The results show that DDI-Transform can accurately predict DDI events and outperform the state-of-the-art models. Results on different scale datasets confirm the robustness of the method.
The prediction of the interaction between a drug and a target is the most critical issue in the fields of drug development and repurposing. However, there are still two challenges in current deep learning research: (i) the structural information of drug molecules is not fully explored in most drug target studies, and the previous drug SMILES does not correspond well to effective drug molecules and (ii) exploration of the potential relationship between drugs and targets is in need of improvement. In this work, we use a new and better representation of the effective molecular graph structure, SELFIES. We propose a hybrid mechanism framework based on convolutional neural network and graph attention network to capture multi-view feature information of drug and target molecular structures, and we aim to enhance the ability to capture interaction sites between a drug and a target. In this study, our experiments using two different datasets show that the GCARDTI model outperforms a variety of different model algorithms on different metrics. We also demonstrate the accuracy of our model through two case studies.
Transfer learning has revolutionized fields including natural language understanding and computer vision by leveraging large‐scale general datasets to pretrain models with foundational knowledge that can then be transferred to improve predictions in a vast range of downstream tasks. More recently, there has been a growth in the adoption of transfer learning approaches in biological fields, where models have been pretrained on massive amounts of biological data and employed to make predictions in a broad range of biological applications. However, unlike in natural language where humans are best suited to evaluate models given a clear understanding of the ground truth, biology presents the unique challenge of being in a setting where there are a plethora of unknowns while at the same time needing to abide by real‐world physical constraints. This perspective provides a discussion of some key points we should consider as a field in designing benchmarks for foundation models in network biology.
Combination therapy is a promising approach to address the challenge of antimicrobial resistance, and computational models have been proposed for predicting drug–drug interactions. Most existing models rely on drug similarity measures based on characteristics such as chemical structure and the mechanism of action. In this study, we focus on the network structure itself and propose a drug similarity measure based on drug–drug interaction networks. We explore the potential applications of this measure by combining it with unsupervised learning and semi-supervised learning approaches. In unsupervised learning, drugs can be grouped based on their interactions, leading to almost monochromatic group–group interactions. In addition, drugs within the same group tend to have similar mechanisms of action (MoA). In semi-supervised learning, the similarity measure can be utilized to construct affinity matrices, enabling the prediction of unknown drug–drug interactions. Our method exceeds existing approaches in terms of performance. Overall, our experiments demonstrate the effectiveness and practicability of the proposed similarity measure. On the one hand, when combined with clustering algorithms, it can be used for functional annotation of compounds with unknown MoA. On the other hand, when combined with semi-supervised graph learning, it enables the prediction of unknown drug–drug interactions.
The identification of tumor driver genes facilitates accurate cancer diagnosis and treatment, playing a key role in precision oncology, along with gene signaling, regulation, and their interaction with protein complexes. To tackle the challenge of distinguishing driver genes from a large number of genomic data, we construct a feature extraction framework for discovering pan-cancer driver genes based on multi-omics data (mutations, gene expression, copy number variants, and DNA methylation) combined with protein–protein interaction (PPI) networks. Using a network propagation algorithm, we mine functional information among nodes in the PPI network, focusing on genes with weak node information to represent specific cancer information. From these functional features, we extract distribution features of pan-cancer data, pan-cancer TOPSIS features of functional features using the ideal solution method, and SetExpan features of pan-cancer data from the gene functional features, a method to rank pan-cancer data based on the average inverse rank. These features represent the common message of pan-cancer. Finally, we use the lightGBM classification algorithm for gene prediction. Experimental results show that our method outperforms existing methods in terms of the area under the check precision-recall curve (AUPRC) and demonstrates better performance across different PPI networks. This indicates our framework’s effectiveness in predicting potential cancer genes, offering valuable insights for the diagnosis and treatment of tumors.
Colorectal cancer (CRC) is one of the most common cancers. Patients with advanced CRC can only rely on chemotherapy to improve outcomes. However, primary drug resistance frequently occurs and is difficult to predict. Changes in plasma protein composition have shown potential in clinical diagnosis. Thus, it is urgent to identify potential protein biomarkers for primary resistance to chemotherapy for patients with CRC. Automatic sample preparation and high-throughput analysis were used to explore potential plasma protein biomarkers. Drug susceptibility testing of circulating tumor cells (CTCs) has been investigated, and the relationship between their values and protein expressions has been discussed. In addition, the differential proteins in different chemotherapy outcomes have been analyzed. Finally, the potential biomarkers have been detected via enzyme-linked immunosorbent assay (ELISA). Plasma proteome of 60 CRC patients were profiled. The correlation between plasma protein levels and the results of drug susceptibility testing of CTCs was performed, and 85 proteins showed a significant positive or negative correlation with chemotherapy resistance. Forty-four CRC patients were then divided into three groups according to their chemotherapy outcomes (objective response, stable disease, and progressive disease), and 37 differential proteins were found to be related to chemotherapy resistance. The overlapping proteins were further investigated in an additional group of 79 patients using ELISA. Protein levels of F5 and PROZ significantly increased in the progressive disease group compared to other outcome groups. Our study indicated that F5 and PROZ proteins could represent potential biomarkers of resistance to chemotherapy in advanced CRC patients.
Effective clinical trials are necessary for understanding medical advances but early termination of trials can result in unnecessary waste of resources. Survival models can be used to predict survival probabilities in such trials. However, survival data from clinical trials are sparse, and DeepSurv cannot accurately capture their effective features, making the models weak in generalization and decreasing their prediction accuracy. In this paper, we propose a survival prediction model for clinical trial completion based on the combination of denoising autoencoder (DAE) and DeepSurv models. The DAE is used to obtain a robust representation of features by breaking the loop of raw features after autoencoder training, and then the robust features are provided to DeepSurv as input for training. The clinical trial dataset for training the model was obtained from the
Copy number variation (CNV) refers to the number of copies of a specific sequence in a genome and is a type of chromatin structural variation. The development of the Hi‐C technique has empowered research on the spatial structure of chromatins by capturing interactions between DNA fragments. We utilized machine‐learning methods including the linear transformation model and graph convolutional network (GCN) to detect CNV events from Hi‐C data and reveal how CNV is related to three‐dimensional interactions between genomic fragments in terms of the one‐dimensional read count signal and features of the chromatin structure. The experimental results demonstrated a specific linear relation between the Hi‐C read count and CNV for each chromosome that can be well qualified by the linear transformation model. In addition, the GCN‐based model could accurately extract features of the spatial structure from Hi‐C data and infer the corresponding CNV across different chromosomes in a cancer cell line. We performed a series of experiments including dimension reduction, transfer learning, and Hi‐C data perturbation to comprehensively evaluate the utility and robustness of the GCN‐based model. This work can provide a benchmark for using machine learning to infer CNV from Hi‐C data and serves as a necessary foundation for deeper understanding of the relationship between Hi‐C data and CNV.
Caenorhabditis elegans has been widely used as a model organism in developmental biology due to its invariant development. In this study, we developed a desktop software CShaperApp to segment fluorescence‐labeled images of cell membranes and analyze cellular morphologies interactively during C. elegans embryogenesis. Based on the previously proposed framework CShaper, CShaperApp empowers biologists to automatically and efficiently extract quantitative cellular morphological data with either an existing deep learning model or a fine‐tuned one adapted to their in‐house dataset. Experimental results show that it takes about 30 min to process a three‐dimensional time‐lapse (4D) dataset, which consists of 150 image stacks at a ~1.5‐min interval and covers C. elegans embryogenesis from the 4‐cell to 350‐cell stages. The robustness of CShaperApp is also validated with the datasets from different laboratories. Furthermore, modularized implementation increases the flexibility in multi‐task applications and promotes its flexibility for future enhancements. As cell morphology over development has emerged as a focus of interest in developmental biology, CShaperApp is anticipated to pave the way for those studies by accelerating the high‐throughput generation of systems‐level quantitative data collection. The software can be freely downloaded from the website of Github (cao13jf/CShaperApp) and is executable on Windows, macOS, and Linux operating systems.
Molecular subtyping of gastric cancer (GC) aims to comprehend its genetic landscape. However, the efficacy of current subtyping methods is hampered by their mixed use of molecular features, a lack of strategy optimization, and the limited availability of public GC datasets. There is a pressing need for a precise and easily adoptable subtyping approach for early DNA-based screening and treatment. Based on TCGA subtypes, we developed a novel DNA-based hierarchical classifier for gastric cancer molecular subtyping (HCG), which employs gene mutations, copy number aberrations, and methylation patterns as predictors. By incorporating the closely related esophageal adenocarcinomas dataset, we expanded the TCGA GC dataset for the training and testing of HCG (n = 453). The optimization of HCG was achieved through three hierarchical strategies using Lasso-Logistic regression, evaluated by their overall the area under receiver operating characteristic curve (auROC), accuracy, F1 score, the area under precision-recall curve (auPRC) and their capability for clinical stratification using multivariate survival analysis. Subtype-specific DNA alteration biomarkers were discerned through difference tests based on HCG defined subtypes. Our HCG classifier demonstrated superior performance in terms of overall auROC (0.95), accuracy (0.88), F1 score (0.87) and auPRC (0.86), significantly improving the clinical stratification of patients (overall p-value = 0.032). Difference tests identified 25 subtype-specific DNA alterations, including a high mutation rate in the SYNE1, ITGB4, and COL22A1 genes for the MSI subtype, and hypermethylation of ALS2CL, KIAA0406, and RPRD1B genes for the EBV subtype. HCG is an accurate and robust classifier for DNA-based GC molecular subtyping with highly predictive clinical stratification performance. The training and test datasets, along with the analysis programs of HCG, are accessible on the GitHub website (
Deep learning has been increasingly popular in omics data analysis. Recent works incorporating variable selection into deep learning have greatly enhanced the model’s interpretability. However, because deep learning desires a large sample size, the existing methods may result in uncertain findings when the dataset has a small sample size, commonly seen in omics data analysis. With the explosion and availability of omics data from multiple populations/studies, the existing methods naively pool them into one dataset to enhance the sample size while ignoring that variable structures can differ across datasets, which might lead to inaccurate variable selection results. We propose a penalized integrative deep neural network (PIN) to simultaneously select important variables from multiple datasets. PIN directly aggregates multiple datasets as input and considers both homogeneity and heterogeneity situations among multiple datasets in an integrative analysis framework. Results from extensive simulation studies and applications of PIN to gene expression datasets from elders with different cognitive statuses or ovarian cancer patients at different stages demonstrate that PIN outperforms existing methods with considerably improved performance among multiple datasets. The source code is freely available on Github (rucliyang/PINFunc). We speculate that the proposed PIN method will promote the identification of disease‐related important variables based on multiple studies/datasets from diverse origins.
Identifying drug–drug interactions (DDIs) is an important aspect of drug design research, and predicting DDIs serves as a crucial guarantee for avoiding potential adverse effects. Current substructure‐based prediction methods still have some limitations: (ⅰ) The process of substructure extraction does not fully exploit the graph structure information of drugs, as it only evaluates the importance of different radius substructures from a single perspective. (ⅱ) The process of constructing drug representations has overlooked the significant impact of relation embedding on optimizing drug representations. In this work, we propose a substructure‐aware graph neural network incorporating relation features (RFSA‐DDI) for DDI prediction, which introduces a directed message passing neural network with substructure attention mechanism based on graph self‐adaptive pooling (GSP‐DMPNN) and a substructure‐aware interaction module incorporating relation features (RSAM). GSP‐DMPNN utilizes graph self‐adaptive pooling to comprehensively consider node features and local drug information for adaptive extraction of substructures. RSAM interacts drug features with relation representations to enhance their respective features individually, highlighting substructures that significantly impact predictions. RFSA‐DDI is evaluated on two real‐world datasets. Compared to existing methods, RFSA‐DDI demonstrates certain advantages in both transductive and inductive settings, effectively handling the task of predicting DDIs for unseen drugs and exhibiting good generalization capability. The experimental results show that RFSA‐DDI can effectively capture valuable structural information of drugs more accurately for DDI prediction, and provide more reliable assistance for potential DDIs detection in drug development and treatment stages.
Epistasis is a ubiquitous phenomenon in genetics, and is considered to be one of main factors in current efforts to unveil missing heritability of complex diseases. Simulation data is crucial for evaluating epistasis detection tools in genome-wide association studies (GWAS). Existing simulators normally suffer from two limitations: absence of support for high-order epistasis models containing multiple single nucleotide polymorphisms (SNPs), and inability to generate simulation SNP data independently. In this study, we proposed a simulator SimHOEPI, which is capable of calculating penetrance tables of high-order epistasis models depending on either prevalence or heritability, and uses a resampling strategy to generate simulation data independently. Highlights of SimHOEPI are the preservation of realistic minor allele frequencies in sampling data, the accurate calculation and embedding of high-order epistasis models, and acceptable simulation time. A series of experiments were carried out to verify these properties from different aspects. Experimental results show that SimHOEPI can generate simulation SNP data independently with high-order epistasis models, implying that it might be an alternative simulator for GWAS.
Mutational signatures refer to distinct patterns of DNA mutations that occur in a specific context or under certain conditions. It is a powerful tool to describe cancer etiology. We conducted a study to show cancer heterogeneity and cancer specificity from the aspect of mutational signatures through collinearity analysis and machine learning techniques. Through thorough training and independent validation, our results show that while the majority of the mutational signatures are distinct, similarities between certain mutational signature pairs can be observed through both mutation patterns and mutational signature abundance. The observation can potentially assist to determine the etiology of yet elusive mutational signatures. Further analysis using machine learning approaches demonstrated moderate mutational signature cancer specificity. Skin cancer among all cancer types demonstrated the strongest mutational signature specificity.
The year 2023 marked a significant surge in the exploration of applying large language model chatbots, notably Chat Generative Pre‐trained Transformer (ChatGPT), across various disciplines. We surveyed the application of ChatGPT in bioinformatics and biomedical informatics throughout the year, covering omics, genetics, biomedical text mining, drug discovery, biomedical image understanding, bioinformatics programming, and bioinformatics education. Our survey delineates the current strengths and limitations of this chatbot in bioinformatics and offers insights into potential avenues for future developments.
Epithelial cell networks imply a packing geometry characterized by various cell shapes and distributions in terms of number of cell neighbors and areas. Despite such simple characteristics describing cell sheets, the formation of bubble‐like cells during the morphogenesis of epithelial tissues remains poorly understood. This study proposes a topological mathematical model of morphogenesis in a squamous epithelial. We introduce a new potential that takes into account not only the elasticity of cell perimeter and area but also the elasticity of their internal angles. Additionally, we incorporate an integral equation for chemical signaling, allowing us to consider chemo‐mechanical cell interactions. In addition to the listed factors, the model takes into account essential processes in real epithelial, such as cell proliferation and intercalation. The presented mathematical model has yielded novel insights into the packing of epithelial sheets. It has been found that there are two main states: one consists of cells of the same size, and the other consists of “bubble” cells. An example is provided of the possibility of accounting for chemo‐mechanical interactions in a multicellular environment. The introduction of a parameter determining the flexibility of cell shapes enables the modeling of more complex cell behaviors, such as considering change of cell phenotype. The developed mathematical model of morphogenesis of squamous epithelium allows progress in understanding the processes of formation of cell networks. The results obtained from mathematical modeling are of significant importance for understanding the mechanisms of morphogenesis and development of epithelial tissues. Additionally, the obtained results can be applied in developing methods to influence morphogenetic processes in medical applications.
Gene regulatory network (GRN) refers to the complex network formed by regulatory interactions between genes in living cells. In this paper, we consider inferring GRNs in single cells based on single‐cell RNA sequencing (scRNA‐seq) data. In scRNA‐seq, single cells are often profiled from mixed populations, and their cell identities are unknown. A common practice for single‐cell GRN analysis is to first cluster the cells and infer GRNs for every cluster separately. However, this two‐step procedure ignores uncertainty in the clustering step and thus could lead to inaccurate estimation of the networks. Here, we consider the mixture Poisson log‐normal model (MPLN) for network inference of count data from mixed populations. The precision matrices of the MPLN are the GRNs of different cell types. To avoid the intractable optimization of the MPLN’s log‐likelihood, we develop an algorithm called variational mixture Poisson log‐normal (VMPLN) to jointly estimate the GRNs of different cell types based on the variational inference method. We compare VMPLN with state‐of‐the‐art single‐cell regulatory network inference methods. Comprehensive simulation shows that VMPLN achieves better performance, especially in scenarios where different cell types have a high mixing degree. Benchmarking on real scRNA‐seq data also demonstrates that VMPLN can provide more accurate network estimation in most cases. Finally, we apply VMPLN to a large scRNA‐seq dataset from patients infected with severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2) and find that VMPLN identifies critical differences in regulatory networks in immune cells between patients with moderate and severe symptoms. The source codes are available on the GitHub website (github.com/XiDsLab/SCVMPLN).