Background: The existence of doublets in single-cell RNA sequencing (scRNA-seq) data poses a great challenge in downstream data analysis. Computational doublet-detection methods have been developed to remove doublets from scRNA-seq data. Yet, the default hyperparameter settings of those methods may not provide optimal performance.
Methods: We propose a strategy to tune hyperparameters for a cutting-edge doublet-detection method. We utilize a full factorial design to explore the relationship between hyperparameters and detection accuracy on 16 real scRNA-seq datasets. The optimal hyperparameters are obtained by a response surface model and convex optimization.
Results: We show that the optimal hyperparameters provide top performance across scRNA-seq datasets under various biological conditions. Our tuning strategy can be applied to other computational doublet-detection methods. It also offers insights into hyperparameter tuning for broader computational methods in scRNA-seq data analysis.
Conclusions: The hyperparameter configuration significantly impacts the performance of computational doublet-detection methods. Our study is the first attempt to systematically explore the optimal hyperparameters under various biological conditions and optimization objectives. Our study provides much-needed guidance for hyperparameter tuning in computational doublet-detection methods.
Background: Morphogenesis is a complex process in a developing animal at the organ, cellular and molecular levels. In this investigation, allometry at the cellular level was evaluated.
Methods: Geometric information, including the time-lapse Cartesian coordinates of each cell’s center, was used for calculating the allometric coefficients. A zero-centroaxial skew-symmetrical matrix (CSSM), was generated and used for constructing another square matrix (basic square matrix: BSM), then the determinant of BSM was calculated (d). The logarithms of absolute d (Lad) of cell group at different stages of development were plotted for all of the cells in a range of development stages; the slope of the regression line was estimated then used as the allometric coefficient. Moreover, the lineage growth rate (LGR) was also calculated by plotting the Lad against the logarithm of the time. The complexity index at each stage was calculated. The method was tested on a developing Caenorhabditis elegans embryo.
Results: We explored two out of the four first generated blastomeres in C. elegans embryo. The ABp and EMS lineages show that the allometric coefficient of ABp was higher than that of EMS, which was consistent with the complexity index as well as LGR.
Conclusion: The conclusion of this study is that the complexity of the differentiating cells in a developing embryo can be evaluated by allometric scaling based on the data derived from the Cartesian coordinates of the cells at different stages of development.
The impressive conversational and programming abilities of ChatGPT make it an attractive tool for facilitating the education of bioinformatics data analysis for beginners. In this study, we proposed an iterative model to fine-tune instructions for guiding a chatbot in generating code for bioinformatics data analysis tasks. We demonstrated the feasibility of the model by applying it to various bioinformatics topics. Additionally, we discussed practical considerations and limitations regarding the use of the model in chatbot-aided bioinformatics education.
Background: Molecular docking-based virtual screening (VS) aims to choose ligands with potential pharmacological activities from millions or even billions of molecules. This process could significantly cut down the number of compounds that need to be experimentally tested. However, during the docking calculation, many molecules have low affinity for a particular protein target, which waste a lot of computational resources.
Methods: We implemented a fast and practical molecular screening approach called DL-DockVS (deep learning dock virtual screening) by using deep learning models (regression and classification models) to learn the outcomes of pipelined docking programs step-by-step.
Results: In this study, we showed that this approach could successfully weed out compounds with poor docking scores while keeping compounds with potentially high docking scores against 10 DUD-E protein targets. A self-built dataset of about 1.9 million molecules was used to further verify DL-DockVS, yielding good results in terms of recall rate, active compounds enrichment factor and runtime speed.
Conclusions: We comprehensively evaluate the practicality and effectiveness of DL-DockVS against 10 protein targets. Due to the improvements of runtime and maintained success rate, it would be a useful and promising approach to screen ultra-large compound libraries in the age of big data. It is also very convenient for researchers to make a well-trained model of one specific target for predicting other chemical libraries and high docking-score molecules without docking computation again.
Background: Computational approaches for accurate prediction of drug interactions, such as drug-drug interactions (DDIs) and drug-target interactions (DTIs), are highly demanded for biochemical researchers. Despite the fact that many methods have been proposed and developed to predict DDIs and DTIs respectively, their success is still limited due to a lack of systematic evaluation of the intrinsic properties embedded in the corresponding chemical structure.
Methods: In this paper, we develop DeepDrug, a deep learning framework for overcoming the above limitation by using residual graph convolutional networks (Res-GCNs) and convolutional networks (CNNs) to learn the comprehensive structure- and sequence-based representations of drugs and proteins.
Results: DeepDrug outperforms state-of-the-art methods in a series of systematic experiments, including binary-class DDIs, multi-class/multi-label DDIs, binary-class DTIs classification and DTIs regression tasks. Furthermore, we visualize the structural features learned by DeepDrug Res-GCN module, which displays compatible and accordant patterns in chemical properties and drug categories, providing additional evidence to support the strong predictive power of DeepDrug. Ultimately, we apply DeepDrug to perform drug repositioning on the whole DrugBank database to discover the potential drug candidates against SARS-CoV-2, where 7 out of 10 top-ranked drugs are reported to be repurposed to potentially treat coronavirus disease 2019 (COVID-19).
Conclusions: To sum up, we believe that DeepDrug is an efficient tool in accurate prediction of DDIs and DTIs and provides a promising insight in understanding the underlying mechanism of these biochemical relations.
Background: The precise and efficient analysis of single-cell transcriptome data provides powerful support for studying the diversity of cell functions at the single-cell level. The most important and challenging steps are cell clustering and recognition of cell populations. While the precision of clustering and annotation are considered separately in most current studies, it is worth attempting to develop an extensive and flexible strategy to balance clustering accuracy and biological explanation comprehensively.
Methods: The cell marker-based clustering strategy (cmCluster), which is a modified Louvain clustering method, aims to search the optimal clusters through genetic algorithm (GA) and grid search based on the cell type annotation results.
Results: By applying cmCluster on a set of single-cell transcriptome data, the results showed that it was beneficial for the recognition of cell populations and explanation of biological function even on the occasion of incomplete cell type information or multiple data resources. In addition, cmCluster also produced clear boundaries and appropriate subtypes with potential marker genes. The relevant code is available in GitHub website (huangyuwei301/cmCluster).
Conclusions: We speculate that cmCluster provides researchers effective screening strategies to improve the accuracy of subsequent biological analysis, reduce artificial bias, and facilitate the comparison and analysis of multiple studies.
Background: As parts of the cis-regulatory mechanism of the human genome, interactions between distal enhancers and proximal promoters play a crucial role. Enhancers, promoters, and enhancer-promoter interactions (EPIs) can be detected using many sequencing technologies and computation models. However, a systematic review that summarizes these EPI identification methods and that can help researchers apply and optimize them is still needed.
Results: In this review, we first emphasize the role of EPIs in regulating gene expression and describe a generic framework for predicting enhancer-promoter interaction. Next, we review prediction methods for enhancers, promoters, loops, and enhancer-promoter interactions using different data features that have emerged since 2010, and we summarize the websites available for obtaining enhancers, promoters, and enhancer-promoter interaction datasets. Finally, we review the application of the methods for identifying EPIs in diseases such as cancer.
Conclusions: The advance of computer technology has allowed traditional machine learning, and deep learning methods to be used to predict enhancer, promoter, and EPIs from genetic, genomic, and epigenomic features. In the past decade, models based on deep learning, especially transfer learning, have been proposed for directly predicting enhancer-promoter interactions from DNA sequences, and these models can reduce the parameter training time required of bioinformatics researchers. We believe this review can provide detailed research frameworks for researchers who are beginning to study enhancers, promoters, and their interactions.
Background: Synthetic microbial communities, with different strains brought together by balancing their nutrition and promoting their interactions, demonstrate great advantages for exploring complex performance of communities and for further biotechnology applications. The potential of such microbial communities has not been explored, due to our limited knowledge of the extremely complex microbial interactions that are involved in designing and controlling effective and stable communities.
Results: Genome-scale metabolic models (GEM) have been demonstrated as an effective tool for predicting and guiding the investigation and design of microbial communities, since they can explicitly and efficiently predict the phenotype of organisms from their genotypic data and can be used to explore the molecular mechanisms of microbe-habitats and microbe-microbe interactions. In this work, we reviewed two main categories of GEM-based approaches and three uses related to design of synthetic microbial communities: predicting multi-species interactions, exploring environmental impacts on microbial phenotypes, and optimizing community-level performance.
Conclusions: Although at the infancy stage, GEM-based approaches exhibit an increasing scope of applications in designing synthetic microbial communities. Compared to other methods, especially the use of laboratory cultures, GEM-based approaches can greatly decrease the trial-and-error cost of various procedures for designing synthetic communities and improving their functionality, such as identifying community members, determining media composition, evaluating microbial interaction potential or selecting the best community configuration. Future efforts should be made to overcome the limitations of the approaches, ranging from quality control of GEM reconstructions to community-level modeling algorithms, so that more applications of GEMs in studying phenotypes of microbial communities can be expected.
Background: Single-cell multi-omics technologies allow a profound system-level biology understanding of cells and tissues. However, an integrative and possibly systems-based analysis capturing the different modalities is challenging. In response, bioinformatics and machine learning methodologies are being developed for multi-omics single-cell analysis. It is unclear whether current tools can address the dual aspect of modality integration and prediction across modalities without requiring extensive parameter fine-tuning.
Methods: We designed LIBRA, a neural network based framework, to learn translation between paired multi-omics profiles so that a shared latent space is constructed. Additionally, we implemented a variation, aLIBRA, that allows automatic fine-tuning by identifying parameter combinations that optimize both the integrative and predictive tasks. All model parameters and evaluation metrics are made available to users with minimal user iteration. Furthermore, aLIBRA allows experienced users to implement custom configurations. The LIBRA toolbox is freely available as R and Python libraries at GitHub (TranslationalBioinformaticsUnit/LIBRA).
Results: LIBRA was evaluated in eight multi-omic single-cell data-sets, including three combinations of omics. We observed that LIBRA is a state-of-the-art tool when evaluating the ability to increase cell-type (clustering) resolution in the integrated latent space. Furthermore, when assessing the predictive power across data modalities, such as predictive chromatin accessibility from gene expression, LIBRA outperforms existing tools. As expected, adaptive parameter optimization (aLIBRA) significantly boosted the performance of learning predictive models from paired data-sets.
Conclusion: LIBRA is a versatile tool that performs competitively in both “integration” and “prediction” tasks based on single-cell multi-omics data. LIBRA is a data-driven robust platform that includes an adaptive learning scheme.
Background: The hierarchical three-dimensional (3D) architectures of chromatin play an important role in fundamental biological processes, such as cell differentiation, cellular senescence, and transcriptional regulation. Aberrant chromatin 3D structural alterations often present in human diseases and even cancers, but their underlying mechanisms remain unclear.
Results: 3D chromatin structures (chromatin compartment A/B, topologically associated domains, and enhancer-promoter interactions) play key roles in cancer development, metastasis, and drug resistance. Bioinformatics techniques based on machine learning and deep learning have shown great potential in the study of 3D cancer genome.
Conclusion: Current advances in the study of the 3D cancer genome have expanded our understanding of the mechanisms underlying tumorigenesis and development. It will provide new insights into precise diagnosis and personalized treatment for cancers.
Background: With the development of rapid and cheap sequencing techniques, the cost of whole-genome sequencing (WGS) has dropped significantly. However, the complexity of the human genome is not limited to the pure sequence—and additional experiments are required to learn the human genome’s influence on complex traits. One of the most exciting aspects for scientists nowadays is the spatial organisation of the genome, which can be discovered using spatial experiments (e.g., Hi-C, ChIA-PET). The information about the spatial contacts helps in the analysis and brings new insights into our understanding of the disease developments.
Methods: We have used an ensemble of deep learning with classical machine learning algorithms. The deep learning network we used was DNABERT, which utilises the BERT language model (based on transformers) for the genomic function. The classical machine learning models included support vector machines (SVMs), random forests (RFs), and K-nearest neighbor (KNN). The whole approach was wrapped together as deep hybrid learning (DHL).
Results: We found that the DNABERT can be used to predict the ChIA-PET experiments with high precision. Additionally, the DHL approach has increased the metrics on CTCF and RNAPII sets.
Conclusions: DHL approach should be taken into consideration for the models utilising the power of deep learning. While straightforward in the concept, it can improve the results significantly.
Background: Machine learning has enabled the automatic detection of facial expressions, which is particularly beneficial in smart monitoring and understanding the mental state of medical and psychological patients. Most algorithms that attain high emotion classification accuracy require extensive computational resources, which either require bulky and inefficient devices or require the sensor data to be processed on cloud servers. However, there is always the risk of privacy invasion, data misuse, and data manipulation when the raw images are transferred to cloud servers for processing facical emotion recognition (FER) data. One possible solution to this problem is to minimize the movement of such private data.
Methods: In this research, we propose an efficient implementation of a convolutional neural network (CNN) based algorithm for on-device FER on a low-power field programmable gate array (FPGA) platform. This is done by encoding the CNN weights to approximated signed digits, which reduces the number of partial sums to be computed for multiply-accumulate (MAC) operations. This is advantageous for portable devices that lack full-fledged resource-intensive multipliers.
Results: We applied our approximation method on MobileNet-v2 and ResNet18 models, which were pretrained with the FER2013 dataset. Our implementations and simulations reduce the FPGA resource requirement by at least 22% compared to models with integer weight, with negligible loss in classification accuracy.
Conclusions: The outcome of this research will help in the development of secure and low-power systems for FER and other biomedical applications. The approximation methods used in this research can also be extended to other image-based biomedical research fields.