Advances in deep learning have significantly aided protein engineering in addressing challenges in industrial production, healthcare, and environmental sustainability. This review frames frequently researched problems in protein understanding and engineering from the perspective of deep learning. It provides a thorough discussion of representation methods for protein sequences and structures, along with general encoding pipelines that support both pre-training and supervised learning tasks. We summarize state-of-the-art protein language models, geometric deep learning techniques, and the combination of distinct approaches to learning from multi-modal biological data. Additionally, we outline common downstream tasks and relevant benchmark datasets for training and evaluating deep learning models, focusing on satisfying the particular needs of protein engineering applications, such as identifying mutation sites and predicting properties for candidates’ virtual screening. This review offers biologists the latest tools for assisting their engineering projects while providing a clear and comprehensive guide for computer scientists to develop more powerful solutions by standardizing problem formulation and consolidating data resources. Future research can foresee a deeper integration of the communities of biology and computer science, unleashing the full potential of deep learning in protein engineering and driving new scientific breakthroughs.
Optimizing enzyme thermostability is essential for advancements in protein science and industrial applications. Currently, (semi-)rational design and random mutagenesis methods can accurately identify single-point mutations that enhance enzyme thermostability. However, complex epistatic interactions often arise when multiple mutation sites are combined, leading to the complete inactivation of combinatorial mutants. As a result, constructing an optimized enzyme often requires repeated rounds of design to incrementally incorporate single mutation sites, which is highly time-consuming. In this study, we developed an AI-aided strategy for enzyme thermostability engineering that efficiently facilitates the recombination of beneficial single-point mutations. We utilized thermostability data from creatinase, including 18 single-point mutants, 22 double-point mutants, 21 triple-point mutants, and 12 quadruple-point mutants. Using these data as inputs, we used a temperature-guided protein language model, Pro-PRIME, to learn epistatic features and design combinatorial mutants. After two rounds of design, we obtained 50 combinatorial mutants with superior thermostability, achieving a success rate of 100%. The best mutant, 13M4, contained 13 mutation sites and maintained nearly full catalytic activity compared to the wild-type. It showed a 10.19°C increase in the melting temperature and an ∼655-fold increase in the half-life at 58°C. Additionally, the model successfully captured epistasis in high-order combinatorial mutants, including sign epistasis (K351E) and synergistic epistasis (D17V/I149V). We elucidated the mechanism of long-range epistasis in detail using a dynamics cross-correlation matrix method. Our work provides an efficient framework for designing enzyme thermostability and studying high-order epistatic effects in protein-directed evolution.
In silico computational methods have been widely utilized to study enzyme catalytic mechanisms and design enzyme performance, including molecular docking, molecular dynamics, quantum mechanics, and multiscale QM/MM approaches. However, the manual operation associated with these methods poses challenges for simulating enzymes and enzyme variants in a high-throughput manner. We developed the NAC4ED, a high-throughput enzyme mutagenesis computational platform based on the “near-attack conformation” design strategy for enzyme catalysis substrates. This platform circumvents the complex calculations involved in transition-state searching by representing enzyme catalytic mechanisms with parameters derived from near-attack conformations. NAC4ED enables the automated, high-throughput, and systematic computation of enzyme mutants, including protein model construction, complex structure acquisition, molecular dynamics simulation, and analysis of active conformation populations. Validation of the accuracy of NAC4ED demonstrated a prediction accuracy of 92.5% for 40 mutations, showing strong consistency between the computational predictions and experimental results. The time required for automated determination of a single enzyme mutant using NAC4ED is 1/764th of that needed for experimental methods. This has significantly enhanced the efficiency of predicting enzyme mutations, leading to revolutionary breakthroughs in improving the performance of high-throughput screening of enzyme variants. NAC4ED facilitates the efficient generation of a large amount of annotated data, providing high-quality data for statistical modeling and machine learning. NAC4ED is currently available at http://lujialab.org.cn/software/.
Halophilic proteins possess unique structural properties and show high stability under extreme conditions. This distinct characteristic makes them invaluable for application in various aspects such as bioenergy, pharmaceuticals, environmental clean-up, and energy production. Generally, halophilic proteins are discovered and characterized through labor-intensive and time-consuming wet lab experiments. In this study, we introduce the Halophilic Protein Classifier (HPClas), a machine learning-based classifier developed using the catBoost ensemble learning technique to identify halophilic proteins. Extensive in silico calculations were conducted on a large public dataset of 12,574 samples and HPClas achieved an area under the receiver operating characteristic curve (AUROC) of 0.844 on an independent test set of 200 samples. The source code and curated dataset of HPClas are publicly available at https://github.com/Showmake2/HPClas. In conclusion, HPClas can be explored as a promising tool to aid in the identification of halophilic proteins and accelerate their application in different fields.
Glycosylphosphatidylinositol (GPI) anchoring is one of the conserved posttranslational modifications in eukaryotes that attach proteins to the plasma membrane. In fungi, in addition to plasma membrane GPI-anchored proteins (GPI-APs), some GPI-APs are specifically released from the cell membrane, secreted into the cell wall, and covalently linked to cell wall glucans as GPI-anchored cell wall proteins (GPI-CWPs). However, it remains unclear how fungal cells specifically release GPI-CWPs from their membranes. In this study, phospholipase PlcH was identified and confirmed as a phospholipase C that hydrolyzes phosphate ester bonds to release GPI-APs from the membrane of the opportunistic fungal pathogen Aspergillus fumigatus. Deletion of the plcH gene led to abnormal conidiation, polar abnormality, and increased sensitivity to antifungal drugs. In an immunocompromised mouse model, the ΔplcH mutant showed an attenuated inflammatory response and increased macrophage killing compared with the wild type. Biochemical and proteomic analyses revealed that PlcH was involved in the localization of various cell wall GPI-APs and contributed to the cell wall integrity. Our results demonstrate that PlcH can specifically recognize and release GPI-CWPs from the cell membrane, which represents a newly discovered secretory pathway of GPI-CWPs in A. fumigatus.
Compelling concerns about antimicrobial resistance and the emergence of multidrug-resistant pathogens call for novel strategies to address these challenges. Nanoparticles show promising antimicrobial activities; however, their actions are hindered primarily by the bacterial hydrophilic–hydrophobic barrier. To overcome this, we developed a method of electrochemically anchoring sodium dodecyl sulfate (SDS) coatings onto silver nanoparticles (AgNPs), resulting in improved antimicrobial potency. We then investigated the antimicrobial mechanisms and developed therapeutic applications. We demonstrated SDS-coated AgNPs with anomalous dispersive properties capable of dispersing in both polar and nonpolar solvents and, further, detected significantly higher bacteriostatic and bactericidal effects compared to silver ions (Ag+). Cellular assays suggested multipotent disruptions targeting the bacterial membrane, evidenced by increasing lactate dehydrogenase, protein and sugar leakage, and consistent with results from the transcriptomic analysis. Notably, the amphiphilic characteristics of the AgNPs maintained robust antibacterial activities for a year at various temperatures, indicating long-term efficacy as a potential disinfectant. In a murine model, the AgNPs showed considerable biocompatibility and could alleviate fatal Salmonella infections. Collectively, by gaining amphiphilic properties from SDS, we offer novel AgNPs against bacterial infections combined with long-term and cost-effective strategies.
Optical density (OD) is an important indicator of microbial density, and a commonly used variable in growth curves to express the growth of microbial culture. However, OD values show a linear relationship with bacterial concentration only at low concentrations. When the cell density is high, the relationship loses linearity, and serial dilution is needed to obtain readings of better accuracy. Here, we show that measuring OD values using shorter light paths is in close equivalence to measuring OD values of the cell culture with corresponding dilution. By measuring three different light paths simultaneously, accurate OD values can be easily obtained from low to high cell density. Using this method, growth curves of Escherichia coli, Staphylococcus aureus, and Pichia pastoris are measured with higher accuracy. To further simplify the process, an L-shaped cuvette and a corresponding turbidimeter are designed specifically for OD value measurement based on the multi-light path transmission method.