Copy number variation (CNV) refers to the number of copies of a specific sequence in a genome and is a type of chromatin structural variation. The development of the Hi‐C technique has empowered research on the spatial structure of chromatins by capturing interactions between DNA fragments. We utilized machine‐learning methods including the linear transformation model and graph convolutional network (GCN) to detect CNV events from Hi‐C data and reveal how CNV is related to three‐dimensional interactions between genomic fragments in terms of the one‐dimensional read count signal and features of the chromatin structure. The experimental results demonstrated a specific linear relation between the Hi‐C read count and CNV for each chromosome that can be well qualified by the linear transformation model. In addition, the GCN‐based model could accurately extract features of the spatial structure from Hi‐C data and infer the corresponding CNV across different chromosomes in a cancer cell line. We performed a series of experiments including dimension reduction, transfer learning, and Hi‐C data perturbation to comprehensively evaluate the utility and robustness of the GCN‐based model. This work can provide a benchmark for using machine learning to infer CNV from Hi‐C data and serves as a necessary foundation for deeper understanding of the relationship between Hi‐C data and CNV.
Mutational signatures refer to distinct patterns of DNA mutations that occur in a specific context or under certain conditions. It is a powerful tool to describe cancer etiology. We conducted a study to show cancer heterogeneity and cancer specificity from the aspect of mutational signatures through collinearity analysis and machine learning techniques. Through thorough training and independent validation, our results show that while the majority of the mutational signatures are distinct, similarities between certain mutational signature pairs can be observed through both mutation patterns and mutational signature abundance. The observation can potentially assist to determine the etiology of yet elusive mutational signatures. Further analysis using machine learning approaches demonstrated moderate mutational signature cancer specificity. Skin cancer among all cancer types demonstrated the strongest mutational signature specificity.
Identifying drug–drug interactions (DDIs) is an important aspect of drug design research, and predicting DDIs serves as a crucial guarantee for avoiding potential adverse effects. Current substructure‐based prediction methods still have some limitations: (ⅰ) The process of substructure extraction does not fully exploit the graph structure information of drugs, as it only evaluates the importance of different radius substructures from a single perspective. (ⅱ) The process of constructing drug representations has overlooked the significant impact of relation embedding on optimizing drug representations. In this work, we propose a substructure‐aware graph neural network incorporating relation features (RFSA‐DDI) for DDI prediction, which introduces a directed message passing neural network with substructure attention mechanism based on graph self‐adaptive pooling (GSP‐DMPNN) and a substructure‐aware interaction module incorporating relation features (RSAM). GSP‐DMPNN utilizes graph self‐adaptive pooling to comprehensively consider node features and local drug information for adaptive extraction of substructures. RSAM interacts drug features with relation representations to enhance their respective features individually, highlighting substructures that significantly impact predictions. RFSA‐DDI is evaluated on two real‐world datasets. Compared to existing methods, RFSA‐DDI demonstrates certain advantages in both transductive and inductive settings, effectively handling the task of predicting DDIs for unseen drugs and exhibiting good generalization capability. The experimental results show that RFSA‐DDI can effectively capture valuable structural information of drugs more accurately for DDI prediction, and provide more reliable assistance for potential DDIs detection in drug development and treatment stages.
Predictive analytics is crucial in precision medicine for personalized patient care. To aid in precision medicine, this study identifies a subset of genetic and clinical variables that can serve as predictors for classifying diseased tissues/disease types. To achieve this, experiments were performed on diseased tissues obtained from the L1000 dataset to assess differences in the functionality and predictive capabilities of genetic and clinical variables. In this study, the k‐means technique was used for clustering the diseased tissue types, and the multinomial logistic regression (MLR) technique was applied for classifying the diseased tissue types. Dimensionality reduction techniques including principal component analysis and Boruta are used extensively to reduce the dimensionality of genetic and clinical variables. The results showed that landmark genes performed slightly better in clustering diseased tissue types compared to any random set of 978 non‐landmark genes, and the difference is statistically significant. Furthermore, it was evident that both clinical and genetic variables were important in predicting the diseased tissue types. The top three clinical predictors for predicting diseased tissue types were identified as morphology, gender, and age of diagnosis. Additionally, this study explored the possibility of using the latent representations of the clusters of landmark and non‐landmark genes as predictors for an MLR classifier. The classification models built using MLR revealed that landmark genes can serve as a subset of genetic variables and/or as a proxy for clinical variables. This study concludes that combining predictive analytics with dimensionality reduction effectively identifies key predictors in precision medicine, enhancing diagnostic accuracy.
Epithelial cell networks imply a packing geometry characterized by various cell shapes and distributions in terms of number of cell neighbors and areas. Despite such simple characteristics describing cell sheets, the formation of bubble‐like cells during the morphogenesis of epithelial tissues remains poorly understood. This study proposes a topological mathematical model of morphogenesis in a squamous epithelial. We introduce a new potential that takes into account not only the elasticity of cell perimeter and area but also the elasticity of their internal angles. Additionally, we incorporate an integral equation for chemical signaling, allowing us to consider chemo‐mechanical cell interactions. In addition to the listed factors, the model takes into account essential processes in real epithelial, such as cell proliferation and intercalation. The presented mathematical model has yielded novel insights into the packing of epithelial sheets. It has been found that there are two main states: one consists of cells of the same size, and the other consists of “bubble” cells. An example is provided of the possibility of accounting for chemo‐mechanical interactions in a multicellular environment. The introduction of a parameter determining the flexibility of cell shapes enables the modeling of more complex cell behaviors, such as considering change of cell phenotype. The developed mathematical model of morphogenesis of squamous epithelium allows progress in understanding the processes of formation of cell networks. The results obtained from mathematical modeling are of significant importance for understanding the mechanisms of morphogenesis and development of epithelial tissues. Additionally, the obtained results can be applied in developing methods to influence morphogenetic processes in medical applications.
Monoclonal antibodies are attractive therapeutic agents in a wide range of human disorders that bind specifically to their target through their complementary‐determining regions (CDRs). Small proteins with structurally preserved CDRs are promising antibodies mimetics. In this in silico study, we presented new antibody mimetics against the cancer marker epidermal growth factor receptor (EGFR) created by the CDRs grafting technique. Ten potential graft acceptor sites that efficiently immobilize the grafted CDR loops were selected from three small protein scaffolds using a computer. The three most involved CDR loops in antibody‐receptor interactions extracted from panitumumab antibody against the EGFR domain III crystal structure were then grafted to the selected scaffolds through the loop randomization technique. The combination of three CDR loops and 10 grafting sites revealed that three of the 36 combinations showed specific binding to EGFR DIII by binding energy calculations. Thus, the present strategy and selected small protein scaffolds are promising tools in the design of new binders against EGFR with high binding energy.
Deep learning has been increasingly popular in omics data analysis. Recent works incorporating variable selection into deep learning have greatly enhanced the model’s interpretability. However, because deep learning desires a large sample size, the existing methods may result in uncertain findings when the dataset has a small sample size, commonly seen in omics data analysis. With the explosion and availability of omics data from multiple populations/studies, the existing methods naively pool them into one dataset to enhance the sample size while ignoring that variable structures can differ across datasets, which might lead to inaccurate variable selection results. We propose a penalized integrative deep neural network (PIN) to simultaneously select important variables from multiple datasets. PIN directly aggregates multiple datasets as input and considers both homogeneity and heterogeneity situations among multiple datasets in an integrative analysis framework. Results from extensive simulation studies and applications of PIN to gene expression datasets from elders with different cognitive statuses or ovarian cancer patients at different stages demonstrate that PIN outperforms existing methods with considerably improved performance among multiple datasets. The source code is freely available on Github (rucliyang/PINFunc). We speculate that the proposed PIN method will promote the identification of disease‐related important variables based on multiple studies/datasets from diverse origins.
Caenorhabditis elegans has been widely used as a model organism in developmental biology due to its invariant development. In this study, we developed a desktop software CShaperApp to segment fluorescence‐labeled images of cell membranes and analyze cellular morphologies interactively during C. elegans embryogenesis. Based on the previously proposed framework CShaper, CShaperApp empowers biologists to automatically and efficiently extract quantitative cellular morphological data with either an existing deep learning model or a fine‐tuned one adapted to their in‐house dataset. Experimental results show that it takes about 30 min to process a three‐dimensional time‐lapse (4D) dataset, which consists of 150 image stacks at a ~1.5‐min interval and covers C. elegans embryogenesis from the 4‐cell to 350‐cell stages. The robustness of CShaperApp is also validated with the datasets from different laboratories. Furthermore, modularized implementation increases the flexibility in multi‐task applications and promotes its flexibility for future enhancements. As cell morphology over development has emerged as a focus of interest in developmental biology, CShaperApp is anticipated to pave the way for those studies by accelerating the high‐throughput generation of systems‐level quantitative data collection. The software can be freely downloaded from the website of Github (cao13jf/CShaperApp) and is executable on Windows, macOS, and Linux operating systems.