Cover illustration
Large language model-based chatbots present novel possibilities and challenges for bioinformatics. The vivid imagery depicted on the cover portrays students engaged an exciting conversation with a chatbot at sunrise, symbolizing the dawn of new prospects in bioinformatics. However, the mist enveloping the mountain valleys serves as a metaphor for the challenges that accompany these opportunities. Positioned alongside the students, an experienced educator devises strategies to[Detail] ...
The impressive conversational and programming abilities of ChatGPT make it an attractive tool for facilitating the education of bioinformatics data analysis for beginners. In this study, we proposed an iterative model to fine-tune instructions for guiding a chatbot in generating code for bioinformatics data analysis tasks. We demonstrated the feasibility of the model by applying it to various bioinformatics topics. Additionally, we discussed practical considerations and limitations regarding the use of the model in chatbot-aided bioinformatics education.
Background: The hierarchical three-dimensional (3D) architectures of chromatin play an important role in fundamental biological processes, such as cell differentiation, cellular senescence, and transcriptional regulation. Aberrant chromatin 3D structural alterations often present in human diseases and even cancers, but their underlying mechanisms remain unclear.
Results: 3D chromatin structures (chromatin compartment A/B, topologically associated domains, and enhancer-promoter interactions) play key roles in cancer development, metastasis, and drug resistance. Bioinformatics techniques based on machine learning and deep learning have shown great potential in the study of 3D cancer genome.
Conclusion: Current advances in the study of the 3D cancer genome have expanded our understanding of the mechanisms underlying tumorigenesis and development. It will provide new insights into precise diagnosis and personalized treatment for cancers.
Background: As parts of the cis-regulatory mechanism of the human genome, interactions between distal enhancers and proximal promoters play a crucial role. Enhancers, promoters, and enhancer-promoter interactions (EPIs) can be detected using many sequencing technologies and computation models. However, a systematic review that summarizes these EPI identification methods and that can help researchers apply and optimize them is still needed.
Results: In this review, we first emphasize the role of EPIs in regulating gene expression and describe a generic framework for predicting enhancer-promoter interaction. Next, we review prediction methods for enhancers, promoters, loops, and enhancer-promoter interactions using different data features that have emerged since 2010, and we summarize the websites available for obtaining enhancers, promoters, and enhancer-promoter interaction datasets. Finally, we review the application of the methods for identifying EPIs in diseases such as cancer.
Conclusions: The advance of computer technology has allowed traditional machine learning, and deep learning methods to be used to predict enhancer, promoter, and EPIs from genetic, genomic, and epigenomic features. In the past decade, models based on deep learning, especially transfer learning, have been proposed for directly predicting enhancer-promoter interactions from DNA sequences, and these models can reduce the parameter training time required of bioinformatics researchers. We believe this review can provide detailed research frameworks for researchers who are beginning to study enhancers, promoters, and their interactions.
Background: Light-driven synthetic microbial consortia are composed of photoautotrophs and heterotrophs. They exhibited better performance in stability, robustness and capacity for handling complex tasks when comparing with axenic cultures. Different from general microbial consortia, the intrinsic property of photosynthetic oxygen evolution in light-driven synthetic microbial consortia is an important factor affecting the functions of the consortia.
Results: In light-driven microbial consortia, the oxygen liberated by photoautotrophs will result in an aerobic environment, which exerts dual effects on different species and processes. On one hand, oxygen is favorable to the synthetic microbial consortia when they are used for wastewater treatment and aerobic chemical production, in which biomass accumulation and oxidized product formation will benefit from the high energy yield of aerobic respiration. On the other hand, the oxygen is harmful to the synthetic microbial consortia when they were used for anaerobic processes including biohydrogen production and bioelectricity generation, in which the presence of oxygen will deactivate some biological components and compete for electrons.
Conclusions: Developing anaerobic processes in using light-driven synthetic microbial consortia represents a cost-effective alternative for production of chemicals from carbon dioxide and light. Thus, exploring a versatile approach addressing the oxygen dilemma is essential to enable light-driven synthetic microbial consortia to get closer to practical applications.
Background: With the development of rapid and cheap sequencing techniques, the cost of whole-genome sequencing (WGS) has dropped significantly. However, the complexity of the human genome is not limited to the pure sequence—and additional experiments are required to learn the human genome’s influence on complex traits. One of the most exciting aspects for scientists nowadays is the spatial organisation of the genome, which can be discovered using spatial experiments (e.g., Hi-C, ChIA-PET). The information about the spatial contacts helps in the analysis and brings new insights into our understanding of the disease developments.
Methods: We have used an ensemble of deep learning with classical machine learning algorithms. The deep learning network we used was DNABERT, which utilises the BERT language model (based on transformers) for the genomic function. The classical machine learning models included support vector machines (SVMs), random forests (RFs), and K-nearest neighbor (KNN). The whole approach was wrapped together as deep hybrid learning (DHL).
Results: We found that the DNABERT can be used to predict the ChIA-PET experiments with high precision. Additionally, the DHL approach has increased the metrics on CTCF and RNAPII sets.
Conclusions: DHL approach should be taken into consideration for the models utilising the power of deep learning. While straightforward in the concept, it can improve the results significantly.
Background: The precise and efficient analysis of single-cell transcriptome data provides powerful support for studying the diversity of cell functions at the single-cell level. The most important and challenging steps are cell clustering and recognition of cell populations. While the precision of clustering and annotation are considered separately in most current studies, it is worth attempting to develop an extensive and flexible strategy to balance clustering accuracy and biological explanation comprehensively.
Methods: The cell marker-based clustering strategy (cmCluster), which is a modified Louvain clustering method, aims to search the optimal clusters through genetic algorithm (GA) and grid search based on the cell type annotation results.
Results: By applying cmCluster on a set of single-cell transcriptome data, the results showed that it was beneficial for the recognition of cell populations and explanation of biological function even on the occasion of incomplete cell type information or multiple data resources. In addition, cmCluster also produced clear boundaries and appropriate subtypes with potential marker genes. The relevant code is available in GitHub website (huangyuwei301/cmCluster).
Conclusions: We speculate that cmCluster provides researchers effective screening strategies to improve the accuracy of subsequent biological analysis, reduce artificial bias, and facilitate the comparison and analysis of multiple studies.
Background: Machine learning has enabled the automatic detection of facial expressions, which is particularly beneficial in smart monitoring and understanding the mental state of medical and psychological patients. Most algorithms that attain high emotion classification accuracy require extensive computational resources, which either require bulky and inefficient devices or require the sensor data to be processed on cloud servers. However, there is always the risk of privacy invasion, data misuse, and data manipulation when the raw images are transferred to cloud servers for processing facical emotion recognition (FER) data. One possible solution to this problem is to minimize the movement of such private data.
Methods: In this research, we propose an efficient implementation of a convolutional neural network (CNN) based algorithm for on-device FER on a low-power field programmable gate array (FPGA) platform. This is done by encoding the CNN weights to approximated signed digits, which reduces the number of partial sums to be computed for multiply-accumulate (MAC) operations. This is advantageous for portable devices that lack full-fledged resource-intensive multipliers.
Results: We applied our approximation method on MobileNet-v2 and ResNet18 models, which were pretrained with the FER2013 dataset. Our implementations and simulations reduce the FPGA resource requirement by at least 22% compared to models with integer weight, with negligible loss in classification accuracy.
Conclusions: The outcome of this research will help in the development of secure and low-power systems for FER and other biomedical applications. The approximation methods used in this research can also be extended to other image-based biomedical research fields.
Background: Morphogenesis is a complex process in a developing animal at the organ, cellular and molecular levels. In this investigation, allometry at the cellular level was evaluated.
Methods: Geometric information, including the time-lapse Cartesian coordinates of each cell’s center, was used for calculating the allometric coefficients. A zero-centroaxial skew-symmetrical matrix (CSSM), was generated and used for constructing another square matrix (basic square matrix: BSM), then the determinant of BSM was calculated (d). The logarithms of absolute d (Lad) of cell group at different stages of development were plotted for all of the cells in a range of development stages; the slope of the regression line was estimated then used as the allometric coefficient. Moreover, the lineage growth rate (LGR) was also calculated by plotting the Lad against the logarithm of the time. The complexity index at each stage was calculated. The method was tested on a developing Caenorhabditis elegans embryo.
Results: We explored two out of the four first generated blastomeres in C. elegans embryo. The ABp and EMS lineages show that the allometric coefficient of ABp was higher than that of EMS, which was consistent with the complexity index as well as LGR.
Conclusion: The conclusion of this study is that the complexity of the differentiating cells in a developing embryo can be evaluated by allometric scaling based on the data derived from the Cartesian coordinates of the cells at different stages of development.