Machine learning meets enzyme engineering: examples in the design of polyethylene terephthalate hydrolases

Rohan Ali; Yifei Zhang

doi:10.1007/s11705-024-2500-7

PDF(867 KB)

Front. Chem. Sci. Eng. ›› 2024, Vol. 18 ›› Issue (12) : 149. DOI: 10.1007/s11705-024-2500-7

Catalysis for a sustainable future - REVIEW ARTICLE

Machine learning meets enzyme engineering: examples in the design of polyethylene terephthalate hydrolases

Rohan Ali¹^,² ,
Yifei Zhang¹^,²

Author information +

History +

Abstract

The trend of employing machine learning methods has been increasing to develop promising biocatalysts. Leveraging the experimental findings and simulation data, these methods facilitate enzyme engineering and even the design of new-to-nature enzymes. This review focuses on the application of machine learning methods in the engineering of polyethylene terephthalate (PET) hydrolases, enzymes that have the potential to help address plastic pollution. We introduce an overview of machine learning workflows, useful methods and tools for protein design and engineering, and discuss the recent progress of machine learning-aided PET hydrolase engineering and de novo design of PET hydrolases. Finally, as machine learning in enzyme engineering is still evolving, we foresee that advancements in computational power and quality data resources will considerably increase the use of data-driven approaches in enzyme engineering in the coming decades.

Graphical abstract

Keywords

machine learning / artificial intelligence / enzyme engineering / polyethylene terephthalate hydrolase / enzyme design

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Rohan Ali, Yifei Zhang. Machine learning meets enzyme engineering: examples in the design of polyethylene terephthalate hydrolases. Front. Chem. Sci. Eng., 2024, 18(12): 149 https://doi.org/10.1007/s11705-024-2500-7

1 Introduction

Human society benefits significantly from synthetic polymers, and plastics have become indispensable materials for nearly all aspects of daily life. With over 82 million metric tons manufactured worldwide, polyethylene terephthalate (PET) is one of the most commonly used synthetic plastics [1]. PET is easy to manufacture and use owing to its thinness, lightness, imperishability, and moldability. Unfortunately, the attributes that enticed consumers to use plastic are also liable for the ecological damage that occurs when it is dumped. PET-based material has been ineffectively managed, resulting in heavy dumping and hazardous environmental buildup, where its persistence becomes a concern, as it is hardly biodegraded. This resistance to natural degradation is due to its properties, such as its highly rigid polymer backbone and high degree of crystallinity [2]. The authorities and legislators, plastic manufacturing and dependent sectors, and consumers are becoming more aware of these challenges, which have escalated research and development of plastic replacement and valorization approaches for a rapid transition from a linear to a circular, sustainable plastic economy [3].

The mechanical recycling of plastic waste necessitates a presorting step while incineration releases hazardous toxins [4]. At the same time, chemical treatments require the use of potent acids/bases and costly catalysts, which often leads to the release of more contaminants into the environment [4]. A workable and sustainable substitute for PET recycling is depolymerization by enzymes under mild conditions [5−9]. In recent years, various enzymes have been found to hydrolyze PET but with lower efficiency on highly crystalline, non-pretreated post-consumer PET waste [10−12]. For example, a hydrolase TfH from actinomycete Thermobifida fusca DSM43793 that can break down both amorphous and crystalline sections of PET films at 55 °C was discovered in 2005 and is thought to be the first enzyme known to degrade PET [10]. IsPETase, an enzyme that has a reasonably high intrinsic PET-degrading activity produced by the bacteria Ideonella sakaiensis, was isolated from the yard of PET bottle-recycling factory in Sakai city, Japan [12]. Extensive biocatalyst engineering is being explored to enhance the performance of these enzymes [13−18]. Using conventional protein engineering strategies including rational design, semi-rational design and directed evolution, researchers have yielded various mutants that possess improved activity and stability. Using IsPETase as a template, various studies have been conducted to engineer the variants with enhanced activities [17,19−22]. In this regard, rational design was employed to enhance the IsPETase’s thermal stability, resulting in a triple mutant (IsPETase^TS) that exhibited a 14-fold increase in activity and 9 °C higher melting temperature (T_m) in comparison to the wild-type (WT) [21]. DuraPETase, with 10 substitutions, is a variant derived computationally by the GRAPE strategy (greedy accumulated strategy for protein engineering), exhibits an elevated T_m of 31 °C and an enhanced degrading activity as compared to IsPETase [13]. Directed evolution has also contributed to the advancement of IsPETase engineering. Through this, a mutant HotPETase was screened from more than 13000 variants, with its T_m increased by 36 °C and activity at 65 °C approximately 43 times higher than IsPETase^TS [19]. As one of the highly efficient mutants of leaf-branch compost cutinase (LCC), LCC^ICCG was generated via site saturation mutagenesis by Tournier et al. [23]. It has optimal activity at 72 °C and can depolymerize approximately 90% of PET into monomers. Currently, Carbios, a French biotech company, is commercializing it [24].

Scientists have recently begun employing data-driven methods to explore the protein’s sequence-function relationship. This data-driven paradigm shift enables the swift and efficient engineering of proteins with minor calculations and a handful of rounds of evolution [25]. In comparison, the directed evolution strategy involves months of lengthy experimentation, and the rational design approach frequently necessitates an in-depth understanding of the reaction mechanism, which can only be obtained with rigorous analysis and high-accuracy simulations [26]. This progressive shift has begun across multiple research disciplines, from almost solely relying on experimental work to hybrid techniques that combine data-driven methodologies and computational models [27−29]. Previously, findings from individual studies would be compiled by researchers, who would then employ the collected data to develop fundamental assumptions. To gain a deeper understanding of the system, they then built simulations adhering to these assumptions. Advances in computing capacity enabled scientists to shift toward data-driven approaches that deploy machine learning algorithms for deducing patterns straight away from data [30,31]. A key advantage of machine learning in enzyme engineering is its ability to be generalized; once it has been trained on defined inputs, it can predict novel variants effectively right away. As with advancements in computational technology and data-collecting strategies, the trend toward data-driven approaches will certainly persist, even while computational models and experimental studies continue to play crucial roles [27].

This review highlights the recently published investigations of machine learning-assisted engineering of PETases for improved stability and catalytic performances. The subsequent sections commence with an overview of machine learning fundamentals and a brief introduction to some key algorithms and models frequently employed in data-driven enzyme engineering. Next, we discuss the recent achievements of data-driven engineering of hydrolases to accelerate the degradation of PET. Lastly, we point out the challenges and future potential of data-driven protein design and engineering. We hope this discussion will encourage more scientists to create and employ artificial intelligence in designing more active and robust industrial biocatalysts.

2 Machine learning essentials: a quick dive into the basics

2.1 Machine learning basics

Humans understand their surroundings by observing and analysis. Assume a child comprehends how to catch a ball: he knows nothing about the physical laws governing motions, though by overseeing frequent trial and error, he ultimately masters grabbing the ball. Meanwhile, he acquires these skills by building an accurate model through its continuous testing against data and amending toward near perfection [31]. The same goes for enzyme engineering, as we know what enzymatic features can be found in the amino acid sequences and the residual arrangements of catalytic sites. However, the sequence data are too large, and nature has yet to explore it fully. The efficient exploration of this untouched sequence space could lead to the generation of novel enzymes. Consequently, machine learning is fitting to be the preferred tool for such exploratory tasks [32].

Machine learning, the subdivision of artificial intelligence, is an array of computational tools driven by various models that employ algorithms to understand, explain, evaluate and draw trends from data. Machine learning algorithms can interpret patterns, draw correlations among protein structures, their sequences, active site residues, or other relevant data, and unfold predictions for achieving specific objectives [33]. Machine learning is divided into two major types: unsupervised learning and supervised learning. Unsupervised learning aims to compress high-dimensional data (clustering of data) or deduce data trends without using labeled sets. Meanwhile, in supervised learning, algorithms classify the data using labeled instructions. Labeled instructions comprise training sets, and each set contains raw input and enviable output. The ultimate objective is to create a predictor for clustering untouched data according to the labeled training data set [34]. Generally, supervised and unsupervised learning are used in combination, which is called semi-supervised learning [35]. When there is insufficient labeled data, semi-supervised learning tasks are employed to build a supervised model utilizing unlabelled data to understand the broad distribution of the data. Another approach, self-supervised learning, negates the demand for labels by masking a section of input data and building a model for anticipating the masked section [36].

2.2 Machine learning workflow

Supervised learning is an effective approach in enzyme engineering because it centers around enhancing one or more of the enzyme’s characteristics. In general, there are three steps in the machine learning workflow (Fig.1). Step 1, the most arduous workflow phase, involves data fetching, organizing, storing and pre-formatting before being fed to the algorithm. Some of the preferred enzyme databases for acquiring information are BRENDA [37], PDB (Protein Data Bank) [38], IntEnzyDB [39], UniProtKB [40,41], EnzymeML [42], SoluProtMutDB [43], ThermoMutDB [44], FireProtDB [45], ProThermDB [46], EnzymeMap [47], ECREACT [48], and others [49−52]. Step 2 involves processing data by algorithms prior to input to the chosen model. Step 3, model validation with test data, is the last step. The raw data are divided into two distinct groups between steps 1 and 2: a portion of it is employed as training subsets and improving predictor parameters (step 2), and the remaining data goes to step 3 for the ultimate validation [34]. The confusion matrix—the entirety of true/false positives and negatives—is often employed to evaluate classification tasks having binary labels or labels from finite selections. Meanwhile, when the labels take continuous values for regression tasks, computing the root-mean-squared error is recommended. Nevertheless, the ultimate validation is executed on the test data set, as the main objective is to attain the predictor’s generalizability on information not used for training [26].

Fig.1 Workflow diagram of machine learning model training. From left to right, step 1 involves fetching and sorting data, step 2 is training algorithm with sorted training/labeled data and building model, and step 3 is validating the model with test data and fine-tuning the model.

Full size|PPT slide

It is also possible for users to select a predictor or fine-tune one at training step 2 by executing the K-fold validation. In this instance, the training data are further divided into K subsets and training sessions are repeated K times, with each of the K subsets reserved for assessments and the rest of the K–1 subsets for training. Next, the fine-tuning is attained using the average performance [34]. Managing data underfitting and overfitting is a significant challenge during step 2 of any supervised learning training. Underfitting emerges when a predictor cannot identify trends in the training data, i.e., when a basic linear model explains nonlinear data dependencies. Overfitting occurs when a predictor learns excessive details and noises and fails to recognize broad trends, causing the predictor’s performance to drastically decline on the test data set compared to the training set. Underfitting and overfitting can occur as a consequence of poor data quality (including high noise, irrelevant/missing attributes, and data biases and/or sparseness) and ineffective algorithm application (such as excess or inadequate parameter adaptability, lack of proper training, or contaminating training data with test data) [26].

2.3 Some key machine learning algorithms and models

Employing an appropriate machine learning classifier algorithm promises great prediction accuracy in recognizing patterns. Several machine learning algorithms have been successfully implemented to augment biocatalyst engineering, such as support vector machines (SVM) [53,54], multivariate analysis [55], artificial neural networks [56,57], Gaussian processes [58,59], ensemble learning [60,61] and reinforcement learning [62]. Studies in the past mostly relied on protein sequences as the sole input; aimed at enhancing predictability, models developed over recent years tend to involve more catalytically relevant data, such as the enzyme and substrate structures. An organized list of some machine learning methods useful in enzyme engineering is provided in Tab.1.

Tab.1 A short list of machine learning tools that are helpful in enzyme engineering

Objective	Machine learning tool	Machine learning algorithms	Input data	Availability	Ref.
Enzyme classification	DeepEC	Convolutional neural network	Protein sequences	Downloadable	[63]
	ECPred	Ensemble (SVM, k-nearest neighbors)	Protein sequences	Webserver	[64]
	mlDEEPre	Ensemble (Convolutional neural network, recurrent neural networks)	Protein sequences	Webserver	[65]

Substrate identification	innov’SAR	Partial least squares regression	Protein structures and sequences	Webserver	[66]
	pNPred	Random forest	Protein structures and sequences	Webserver	[67]
	AdenylPred	Random forest	Protein sequences	N/A	[68]

Enzyme catalytic site prediction	PREvaIL	Random forest	Protein structures and sequences	Downloadable	[69]
	3DCNN	Convolutional neural network	Protein structure	N/A	[70]
	MAHOMES	Random forest	Protein structures	Downloadable	[32]
	POOL	POOL	Protein structures and sequences	Webserver	[71]

Optimum condition prediction	TOME	Random forest	Protein sequences	Downloadable	[72]
Optimum condition prediction	TAXyl	Random forest	Protein sequences	Downloadable	[73]

Enzyme activity prediction	DLKcat	Graph-based and Convolutional neural networks	Substrates from SMILES and enzyme sequences	Downloadable	[74]
	MaxEnt	Statistical Potts model	Single and pairwise amino acid frequencies from MSA	Downloadable	[75]
	MutCompute	Self-supervised convolutional neural network	Protein structures	Webserver	[76]
	innov’SAR	Partial least-squares regression	N/A	N/A	[77]
	Machine learning-variants-Hoie-et-al	Random forest	Custom preprocessing, derived from PRISM approach	Downloadable	[78]
	innov’SAR	Partial least squares regression	Protein sequences	N/A	[77]
	SolventNet	Convolutional neural network	Protein structures	N/A	[79]
	EnzyKR	Classifier-regressor architecture	Substrate-hydrolase complexes	Downloadable	[80]

Stability prediction (ΔΔG)	BayeStab	Bayesian neural networks	Protein structures	Webserver	[81]
	PROST	Ensemble model	Protein sequences	Downloadable	[82]
	KORPM	Nonlinear regression	Protein sequences	Downloadable	[83]
	ABYSSAL	Siamese deep neural networks	Protein sequences	Downloadable	[84]
	TOMER	bagging with resampling	Protein sequences	Downloadable	[85]

Protein solubility	PON-Sol2	LightGBM	Protein sequences	Webserver	[86]

Protein design	bmDCA	Direct coupling analysis	Protein sequences	Downloadable	[87]
	bmDCA	Linear regression	Protein structures	Downloadable	[88]
	ProteinMPNN	Message-passing neural network	Protein structures	Downloadable	[89]
	RFdiffusion	Denoising diffusion probabilistic model	Protein structures	Downloadable	[90]
	FoldingDiff	Denoising diffusion probabilistic model	Backbones from the CATH data set	Downloadable	[91]
	GearNet	Graph neural network	Protein structures	Downloadable	[92]

3 Recent progress: machine learning-driven engineering of PET hydrolase

3.1 Leveraging machine learning for enhancing enzyme activity

Beside rational design, semi-rational design and directed evolution, machine learning approaches are now being frequently employed to identify patterns in the data to aid in predicting enzyme structure, enhancing activity, stability and solubility, optimizing enzyme reaction kinetics, and guiding rational design (Fig.2). Lu et al. [16] used MutCompute, a three-dimensional (3D) convolutional neural network model trained on over 19000 protein sequences and structures from the PDB database. MutCompute can predict candidate residues for mutagenesis according to their favorability in the native microenvironments [76]. It consists of nine layers split into two sections: feature extraction and classification (Fig.2(a)). A total of six layers formed the feature extraction section: two sets of 3D convolutional layers, each ending with a dimension reduction max pooling layer. The results from the four convolutional layers were subjected to the rectified linearity unit function (Relu). The classification section had three completely integrated dense layers with dropout rates of 0.5, 0.2, and 0. The Relu function was applied to the results of the first two layers, whereas the softmax activation function was used for the third dense layer’s results to generate a vector of 20 probability values indicating predictions for each amino acid. Lu et al. [16] employed MutCompute to select 34 and 39 sites from WT PETase (PDB: 5XJH) and ThermoPETase (PDB: 6IJ6), respectively. Then a step-wise combination of these sites generated 159 single and multiple mutations. Among them, four mutations (S121E, T140D, R224Q, and N233K) showed the most improvement and were used to generate all 29 possible combinations on three PETase scaffolds (WT PETase, ThermoPETase, and DuraPETase). Within ThermoPETase scaffold the best variant that emerged was termed FAST-PETase (functional, active, stable and tolerant PETase) and contains five mutations (D186H/R280A from the scaffold and N233K/R224Q/S121E from prediction). FAST-PETase outperforms both WT and contemporaneous tailored variants for PET hydrolytic activity at temperatures between 30 and 50 °C, producing 33.8 mmol·L^–1 of PET monomers in 96 h. More importantly, FAST-PETase can almost entirely break down untreated postconsumer-PET from 51 diverse thermoformed items within a week. The same algorithm MutCompute was also applied by Meng et al. to analyze the X-ray crystallographic structure of the TfCut2 WT (Thermobifida fusca cutinase, PDB ID: 4CG1) to locate possible “instability hotspots”. [17] As a result, 44 residues were identified as “disfavored residues”, which were further ranked by considering the extent of improvement between TfCut2 WT and putative substitution (log₂ predict/WT). Based on the experimental mutagenesis of the top ten ranked residues, the best variant, L32E/S113E/T237Q, was finally obtained. This variant demonstrated a 5.3-fold increase in crystalline PET powder hydrolysis and the half inactivation temperature (T⁵⁰₆₀) of 5.7 °C greater than TfCut2 WT. Lastly, a more effective binding mode that offers improved PET accessibility to the active site was put forth as justification for the enhanced performance of the triple variant following Markov state model-based conformation dynamics analysis.

Fig.2 Typical machine learning-based strategies for performance and stability enhancement of PET hydrolase. (a) The features of the local microenvironment from proteins in the PDB were extracted by feeding structural characteristics into a series of layers, and as an output, each amino acid was given a probability value (showing chemical congruency). After classification, the predicted mutations obtained were experimentally validated and four mutations showed improvements. These four mutations were introduced into three PETase scaffolds, and with ThermoPETase as a scaffold, the most efficient variant FAST-PETase was obtained. Reprinted with permission from Ref. [16], copyright 2022, Springer Nature. (b) Promising mutations were predicted by the Transformer model after being trained on two different data sets. The mutations in PET binding site residues lead to the H218S/F222I variant (M2). There was a decrease in stability with further engineering of M2 by adding the remaining mutations at predicted sites. Therefore, the GRAPE strategy was used, resulting in the final mutant TurboPETase. Reprinted with permission from Ref. [18], copyright 2024, Springer Nature. (c) Three different machine learning algorithms (Logistic, SVM and Random Forest) were employed on MD trajectories generated by data from Protherm, to understand protein stability and structural features. The learned rules were employed for building a Random Forest-based model to predict thermal stability (T_m) changes leading to the generation of the TfCut^PSP mutant. Reprinted with permission from Ref. [15], copyright 2022, Research Network of Computational and Structural Biotechnology.

Full size|PPT slide

Mono 2-hydroxyethyl terephthalate (MHET) produced during PET depolymerization by PET hydrolases inhibits in turn the enzymatic activity. MHETase is, therefore crucial for complete PET depolymerization. However, currently available MHETases are inefficient and thermally unstable. Thereby, Zhang et al. [93] constructed a fused dual enzyme system comprising KL-MHETase, a double variant of a thermophilic carboxylesterase, and FAST-PETase. The KL-MHETase demonstrates 67-fold increased activity for MHET degradation and a T_m of 67.58 °C, resembling that of FAST-PETase (67.80 °C). A fused dual enzyme system developed by using prediction tool AlphaFold2 was able to depolymerize PET 2.6 times quicker than FAST-PETase alone. Despite recent breakthroughs of 90% PET conversion, the remaining 10% is still nonbiodegradable owing to its physical aging—the amorphous portion to crystalline microstructures—restricting its use in real-world industrial applications. In the latest investigation, Cui et al. [18] combine force-field-based algorithms and a protein language model (Fig.2(b)) to develop a hydrolase from bacterial strain HR29, tackling this immediate issue. 15051 sequences associated with Pfam family Cutinase (PF01083) were gathered from the UniProt database, and sequences similar to BhrPETase and LCC^ICCG were sought from Uniclust30 and the Big Fantastic Database. The encoder was comprised of three transformer layers and eight heads, with 512 as the embedding size. The decoder outputs token probabilities according to the encoder embeddings. To predict the actual amino acid at the masked position, the algorithm was trained using a masked language modeling approach. Models had been trained for 20 epochs with a pool size of 32 using the Adam optimizer, with a learning speed of 3e^–4. Logits linked to the WT amino acid were applied to sort the residues. From each model’s prediction, the top ten residue sites with the highest overall scores were chosen. A total of 18 residue locations were selected, of which it was proposed that W104, H164, M166, W190, H191, H218 and F/I243 were placed within the PET binding region. The GRAPE strategy was applied to the best mutant M2 (BhrPETase^H218S/F222I) obtained through the Transformer model, yielding a mutant TurboPETase (BhrPETase^{H218S/F222I/A209R/D238K/A251C/A281C/W104L/F243T}). TurboPETase outplays all PET hydrolases previously reported, resulting in approximately 100% degradation of pretreated post-consumer PET waste. The complete breakdown of pretreated PET on substantial industrial quantities (up to 300 g·L^–1) can be achieved in 10 h, with an optimal yield rate of 77.3 gTPA_eq·L^–1·h^–1. According to structural analysis and kinetic parameters determined from the inverse Michaelis-Menten model, a more dynamic PET binding cleft may render accessibility to target more precise attack sites leading to its better depolymerization activity.

3.2 Leveraging machine learning for improving enzyme thermostability

PET hydrolases are supposed to function at temperatures close to or above the glass transition temperature of PET. Previous studies have indicated that the optimal temperature (T_opt) range for PET enzymatic hydrolysis is often between 50 and 70 °C, thus the enzyme thermostability is essential for its industrial uses [94,95]. Cutinase TfCut2 can degrade and upcycle PET, while the low thermal stability limits its potential. Most approaches to enhancing protein stability depend on crystal structures, ignoring that proteins’ true conformations are convoluted and dynamic. So, Li et al. [15] combined molecular dynamics (MD) simulations and machine learning to develop MDL method, for analyzing features gathered from MD simulations to predict the thermostability of protein variants (Fig.2(c)). About 86 3D structures of WT proteins were obtained from the PDB database, and 1293 single-point mutations in these proteins were identified from ProThermDB using experimentally determined temperature parameters. Random Forest was selected as the algorithm for building regression and classification models. The features like ΔΔG, four amino acid network parameters and hydrogen bond/ionic bond data were extracted from MD simulations and later used as the input features in building the machine learning model, while the Pearson correlation coefficient (PCC) values between experimentally recorded and predicted ΔT_m values served as performance benchmarks. Prior to the MD simulations, the protein structure’s PCC values were determined using FoldX [96]. To identify conserved domains and mutation hotspots, a total of 177 substitutions with predicted ΔT_m > 1 °C were analyzed by BLAST. Nine of these variations—D174S, S121P, E202L, S113P, S113E, S163K, S194P, A55V, and D204P—were chosen for experimental validation. The excellent variant, S121P/D174S/D204P, exhibited the highest ΔT_m value of 9.3 °C and a 46-fold increase in PET degradation ratio at 70 °C compared to WT TfCut2. The approach by Gupta and Agrawal [97] involved utilizing machine learning to carry out in silico-directed evolution to tailor PETase, enabling the enzyme to break down PET more effectively by increasing its T_opt. First, they trained three machine learning models—Random Forest, Logistic Regression, and Linear Regression—to predict T_opt accurately. Afterwards, they performed directed evolution with machine learning supervision using Random Forest. The algorithm yielded hundreds of PETase mutants, filtered by Random Forest to find the mutants with the highest T_opt. A novel mutant of PETase with a T_opt of 71.38 °C was found after 1000 iterations. After 29 iterations, a unique mutant enzyme with a T_opt of 61.3 °C was obtained. An external predictor was employed to estimate these mutant enzymes’ T_m to ensure their stability. Through this, a 29-iteration mutant was discovered to have better thermostability than PETase. Researchers can enhance the effectiveness of other enzymes by implementing this methodology and unique algorithm. Ding et al. [14] implied two approaches for rationally redesigning LCC^ICCG. The first approach employed a hidden Markov model and obtained 4203 homologous protein sequences for LCC^ICCG, IsPETase, and DuraPETase from the Uniparc database. Then the optimum activity temperature for all these sequences was predicted using Preoptem (a deep learning model). After adopting an identity cutoff of ≥ 40%, all sequences having a significant divergence from LCC^ICCG were eliminated, leaving 2834 protein sequences. The remaining sequences were categorised into high-temperature (149 sequences) and low-temperature (2685 sequences) groups. Next, 18 mutations with high evolutionary potential to stabilize the protein at high temperatures were selected based on the position-specific amino acid probability analysis. The second approach relied on co-evolutionary analysis and Preoptem prediction. Initially, 3441 LCC^ICCG-like sequences were extracted from the UniParc database. The possible impacts of each point substitution on the evolutionary trajectory for that residue were then analyzed by the EVmutation function in the EVcouplings software. The top 5% of all predicted mutation sites were selected to form a virtual library of 106 mutations, considering the high evolutionary energy. The optimum temperature of the predicted substitutions was determined using a Preoptem regression model, and another 18 mutations were selected for experimental evaluation. Among these 36 mutations, only six-point mutations (S32L, D18T, S98R, T157P, E173Q, and N213P) were found to have better performances. LCC^ICCG_I6M mutant emerged, featuring these six single-point mutations and leading to a 1.04 °C rise in the T_m as compared to that of WT LCC^ICCG. The PET water bottle degradation by LCC^ICCG_I6M yielded 3.64 times more soluble products in 24 h than that of LCC^ICCG. The summary of machine learning-engineered PET hydrolases with enhanced performance is depicted in Fig.3.

Fig.3 Performances of PET hydrolases engineered by machine learning methods. The figure illustrates the increase in activity fold and changes in T_m of the engineered PET hydrolases sorted by their year of discovery. From left to right: TfCut2 (blue) was used as the WT to engineer ① TfCut2^PSP (green) by using MD simulations with MDL [15], ② TfCut2^EEQ (blue) by MutCompute model [17]; ③ Fast-PETase (purple) was engineered by employing MutCompute model from Thermo-PETase (purple) [16], ④ LCC^ICCG-I6M (red) by Preoptem and evolutionary analysis from LCC^ICCG (orange) [14]; ⑤ TurboPETase (green) was engineered from BhrPETase (red) by integrating protein language and model force-field-based algorithms [18]. The color reflects the T_m value of each PET hydrolase.

Full size|PPT slide

**3.3 Leveraging machine learning for the de novo design**

Because PET hydrolases are currently scarce, protein design enables us to produce new enzymes without relying on the laborious and cumbersome process of enzyme mining. Recent breakthroughs have shown the applicability of machine learning methods for protein structure prediction and sequence generation, expanding our potential to design de novo proteins. In this regard, Ding et al. [98] extracted the catalytic triad of LCC and adopted an inpainting strategy to produce novel sequences that code protein scaffolds supporting the catalytic triad and some nearby residues. Next, they computationally screened the potentially reasonable sequences based on the AlphaFold2-predicted protein structures and molecular dynamics simulations. The filtered sequences were then assessed by expression and activity assays, and the ineffective designs were further revised by iterative inpainting or optimized with ProteinMPNN (Fig.4). Finally, three novel PET hydrolases were developed, and one of them, named RsPETase 1, was expressed effectively. Despite having a 30% shorter sequence length and only 34% sequence similarity compared to the template LCC, RsPETase 1 has comparable activity with IsPETase and considerate thermostability (as indicated by its T_m of 56 °C). This strongly indicates that recent advancements in computational and machine learning-based tools have enabled us to design new-to-nature enzymes with desired activity without relying on genomic mining.

Fig.4 A computational workflow of protein scaffold remodeling for creating designer PET hydrolases. The workflow comprises four stages: in the scaffold remodeling stage, catalytic sites and some adjacent structures were extracted and the missing scaffold sequences were generated by inpainting. Then, in the computational screening stage, newly generated sequences were computationally analyzed based on protein microenvironment features. In the experimental validation stage, the sequences were expressed and evaluated based on activity and expression level. Designs exhibiting low activity and expression were then fed to the sequence refinement stage for improvement by employing machine learning-based strategies (RF_joint and ProteinMPNN). The quality designs were obtained by employing iterative rounds of sequence refinement, computational screening, and experimental validation.

Full size|PPT slide

4 Conclusions and future outlook

In recent years, developing promising machine learning-based methods has become crucial in accelerating and nurturing the outcomes of predictive computing in advancing enzyme engineering. This review updates readers about the recent progress in the machine learning-driven engineering of PET hydrolases for sustainable PET depolymerization and upcycling. So far, protein structure predictability is accelerating, but designing proteins with promising catalytic functionality is still challenging. Designer enzymes are inefficient and generally need to be further optimized by directed evolution. The reason for this poor catalysis might be that a promising functionality necessitates a more precise arrangement of catalytic residues which current algorithms lack. Moreover, an inadequate introduction of protein dynamics might also have contributed to this issue. It is difficult to build a robust and broadly applicable machine learning-driven model, as limited by challenges such as the unavailability of large quantity of high-quality databases, inadequate data processing procedures, the unavailability of multiple tasks operation in one machine learning pipeline, lack of the model’s interpretability, and hidden data biases (negative experimental findings are not published). Further, to promote the broad acceptance of machine learning tools, enzyme engineers must find them easier to use and more convenient to apply in practical settings. Currently, the criteria are unclear for assessing machine learning successes in protein engineering, as instances illustrating data-driven design are sparse, and the algorithm’s authenticating is confined to very few data sets. We anticipate that the impact of data-driven approaches on enzyme design and engineering will expand notably in the coming decades as more high-quality data and user-friendly models become publicly available.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	BescondA SPujariA. PET Polymer—Chemical Economics Handbook (IHS Markit). 2020

[2]	Carr C M , Clarke D J , Dobson A D W . Microbial polyethylene terephthalate hydrolases: current and future perspectives. Frontiers in Microbiology, 2020, 11: 571265 CrossRef Google scholar

[3]	Wei R , von Haugwitz G , Pfaff L , Mican J , Badenhorst C P S , Liu W , Weber G , Austin H P , Bednar D , Damborsky J . . Mechanism-based design of efficient PET hydrolases. ACS Catalysis, 2022, 12(6): 3382–3396 CrossRef Google scholar

[4]	Fang Y , Chao K , He J , Wang Z , Chen Z . High-efficiency depolymerization/degradation of polyethylene terephthalate plastic by a whole-cell biocatalyst. Biotech, 2023, 13(5): 138

[5]	Ambrose-Dempster E , Leipold L , Dobrijevic D , Bawn M , Carter E M , Stojanovski G , Sheppard T D , Jeffries J W , Ward J M , Hailes H C . Mechanoenzymatic reactions for the hydrolysis of PET. RSC Advances, 2023, 13(15): 9954–9962 CrossRef Google scholar

[6]	Cao F , Wang L , Zheng R , Guo L , Chen Y , Qian X . Research and progress of chemical depolymerization of waste PET and high-value application of its depolymerization products. RSC Advances, 2022, 12(49): 31564–31576 CrossRef Google scholar

[7]	Lai J , Huang H , Lin M , Xu Y , Li X , Sun B . Enzyme catalyzes ester bond synthesis and hydrolysis: the key step for sustainable usage of plastics. Frontiers in Microbiology, 2023, 13: 1113705 CrossRef Google scholar

[8]	Magalhães R P , Cunha J M , Sousa S F . Perspectives on the role of enzymatic biocatalysis for the degradation of plastic PET. International Journal of Molecular Sciences, 2021, 22(20): 11257 CrossRef Google scholar

[9]	Akram E , Cao Y , Xing H , Ding Y , Luo Y , Wei R , Zhang Y . On the temperature dependence of enzymatic degradation of poly(ethylene terephthalate). Chinese Journal of Catalysis, 2024, 60: 284–293 CrossRef Google scholar

[10]	Müller R J , Schrader H , Profe J , Dresler K , Deckwer W D . Enzymatic degradation of poly(ethylene terephthalate): rapid hydrolyse using a hydrolase from T. fusca. Macromolecular Rapid Communications, 2005, 26(17): 1400–1405 CrossRef Google scholar

[11]

Sulaiman S , Yamato S , Kanaya E , Kim J J , Koga Y , Takano K , Kanaya S . Isolation of a novel cutinase homolog with polyethylene terephthalate-degrading activity from leaf-branch compost by using a metagenomic approach. Applied and Environmental Microbiology, 2012, 78(5): 1556–1562

CrossRef Google scholar

[12]	Yoshida S , Hiraga K , Takehana T , Taniguchi I , Yamaji H , Maeda Y , Toyohara K , Miyamoto K , Kimura Y , Oda K . A bacterium that degrades and assimilates poly(ethylene terephthalate). Science, 2016, 351(6278): 1196–1199 CrossRef Google scholar

[13]	Cui Y , Chen Y , Liu X , Dong S , Tian Y E , Qiao Y , Mitra R , Han J , Li C , Han X . . Computational redesign of a PETase for plastic biodegradation under ambient condition by the grape strategy. ACS Catalysis, 2021, 11(3): 1340–1350 CrossRef Google scholar

[14]	Ding Z , Xu G , Miao R , Wu N , Zhang W , Yao B , Guan F , Huang H , Tian J . Rational redesign of thermophilic PET hydrolase LCCICCG to enhance hydrolysis of high crystallinity polyethylene terephthalates. Journal of Hazardous Materials, 2023, 453: 131386 CrossRef Google scholar

[15]	Li Q , Zheng Y , Su T , Wang Q , Liang Q , Zhang Z , Qi Q , Tian J . Computational design of a cutinase for plastic biodegradation by mining molecular dynamics simulations trajectories. Computational and Structural Biotechnology Journal, 2022, 20: 459–470 CrossRef Google scholar

[16]	Lu H , Diaz D J , Czarnecki N J , Zhu C , Kim W , Shroff R , Acosta D J , Alexander B R , Cole H O , Zhang Y . . Machine learning-aided engineering of hydrolases for PET depolymerization. Nature, 2022, 604(7907): 662–667 CrossRef Google scholar

[17]	Meng S , Li Z , Zhang P , Contreras F , Ji Y , Schwaneberg U . Deep learning guided enzyme engineering of Thermobifida fusca cutinase for increased PET depolymerization. Chinese Journal of Catalysis, 2023, 50: 229–238 CrossRef Google scholar

[18]	Cui Y , Chen Y , Sun J , Zhu T , Pang H , Li C , Geng W C , Wu B . Computational redesign of a hydrolase for nearly complete PET depolymerization at industrially relevant high-solids loading. Nature Communications, 2024, 15(1): 1417 CrossRef Google scholar

[19]	Bell E L , Smithson R , Kilbride S , Foster J , Hardy F J , Ramachandran S , Tedstone A A , Haigh S J , Garforth A A , Day P J . . Directed evolution of an efficient and thermostable PET depolymerase. Nature Catalysis, 2022, 5(8): 673–681 CrossRef Google scholar

[20]	Liu F , Wang T , Yang W , Zhang Y , Gong Y , Fan X , Wang G , Lu Z , Wang J . Current advances in the structural biology and molecular engineering of PETase. Frontiers in Bioengineering and Biotechnology, 2023, 11: 1263996 CrossRef Google scholar

[21]	Son H F , Cho I J , Joo S , Seo H , Sagong H Y , Choi S Y , Lee S Y , Kim K J . Rational protein engineering of thermo-stable PETase from Ideonella sakaiensis for highly efficient PET degradation. ACS Catalysis, 2019, 9(4): 3519–3526 CrossRef Google scholar

[22]	Zurier H S , Goddard J M . A high-throughput expression and screening platform for applications-driven PETase engineering. Biotechnology and Bioengineering, 2023, 120(4): 1000–1014 CrossRef Google scholar

[23]	Tournier V , Topham C , Gilles A , David B , Folgoas C , Moya Leclair E , Kamionka E , Desrousseaux M L , Texier H , Gavalda S . . An engineered PET depolymerase to break down and recycle plastic bottles. Nature, 2020, 580(7802): 216–219 CrossRef Google scholar

[24]	Thiyagarajan S , Maaskant-Reilink E , Ewing T A , Julsing M K , Van Haveren J . Back-to-monomer recycling of polycondensation polymers: opportunities for chemicals and enzymes. RSC Advances, 2022, 12(2): 947–970 CrossRef Google scholar

[25]	YangK KWuZArnoldF H. Machine learning in protein engineering. Preprint arXiv, 2018, arXiv:181110775

[26]	Mazurenko S , Prokop Z , Damborsky J . Machine learning in enzyme engineering. ACS Catalysis, 2020, 10(2): 1210–1223 CrossRef Google scholar

[27]	Chang C , Deringer V L , Katti K S , Van Speybroeck V , Wolverton C M . Simulations in the era of exascale computing. Nature Reviews. Materials, 2023, 8(5): 309–313 CrossRef Google scholar

[28]	Pyzer-Knapp E O , Pitera J W , Staar P W , Takeda S , Laino T , Sanders D P , Sexton J , Smith J R , Curioni A . Accelerating materials discovery using artificial intelligence, high performance computing and robotics. npj Computational Materials, 2022, 8(1): 84 CrossRef Google scholar

[29]	Singh V , Patra S , Murugan N A , Toncu D C , Tiwari A . Recent trends in computational tools and data-driven modeling for advanced materials. Materials Advances, 2022, 3(10): 4069–4087 CrossRef Google scholar

[30]	Beller M , Bender M , Bornscheuer U T , Schunk S . Catalysis—Far from Being a Mature Technology. Chemieingenieurtechnik, 2022, 94(11): 1559–1559 CrossRef Google scholar

[31]	Greener J G , Kandathil S M , Moffat L , Jones D T . A guide to machine learning for biologists. Nature Reviews. Molecular Cell Biology, 2022, 23(1): 40–55 CrossRef Google scholar

[32]	Feehan R , Montezano D , Slusky J S . Machine learning for enzyme engineering, selection and design. Protein Engineering, Design & Selection, 2021, 34: gzab019

[33]	Markus B , C G C , Andreas K , Arkadij K , Stefan L , Gustav O , Elina S , Radka S . Accelerating biocatalysis discovery with machine learning: a paradigm shift in enzyme engineering, discovery, and design. ACS Catalysis, 2023, 13(21): 14454–14469 CrossRef Google scholar

[34]	Sampaio P S , Fernandes P . Machine learning: a suitable method for biocatalysis. Catalysts, 2023, 13(6): 961 CrossRef Google scholar

[35]	Olivier ChapelleB SAlexanderZ. A continuation method for semi-supervised SVMs. In: Proceedings of the 23rd International Conference on Machine learning, NY: ACM Press, 2006, 185–192

[36]	Kouba P , Kohout P , Haddadi F , Bushuiev A , Samusevich R , Sedlar J , Damborsky J , Pluskal T , Sivic J , Mazurenko S . Machine learning-guided protein engineering. ACS Catalysis, 2023, 13(21): 13863–13895 CrossRef Google scholar

[37]	Schomburg I , Chang A , Schomburg D . Brenda, enzyme data and metabolic information. Nucleic Acids Research, 2002, 30(1): 47–49 CrossRef Google scholar

[38]	Berman H M , Westbrook J , Feng Z , Gilliland G , Bhat T N , Weissig H , Shindyalov I N , Bourne P E . The Protein Data Bank. Nucleic Acids Research, 2000, 28(1): 235–242 CrossRef Google scholar

[39]	Yan B , Ran X , Gollu A , Cheng Z , Zhou X , Chen Y , Yang Z J . IntEnzyDB: an integrated structure-kinetics enzymology database. Journal of Chemical Information and Modeling, 2022, 62(22): 5841–5848 CrossRef Google scholar

[40]	Consortium T U . UniProt: a worldwide hub of protein knowledge. Nucleic Acids Research, 2019, 47(D1): D506–D515 CrossRef Google scholar

[41]	Consortium T U . UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Research, 2021, 49(D1): D480–D489 CrossRef Google scholar

[42]	Pleiss J . Standardized data, scalable documentation, sustainable storage-EnzymeML as a basis for FAIR data management in biocatalysis. ChemCatChem, 2021, 13(18): 3909–3913 CrossRef Google scholar

[43]	Velecký J , Hamsikova M , Stourac J , Musil M , Damborsky J , Bednar D , Mazurenko S . SoluProtMutDB: a manually curated database of protein solubility changes upon mutations. Computational and Structural Biotechnology Journal, 2022, 20: 6339–6347 CrossRef Google scholar

[44]	Xavier J S , Nguyen T B , Karmarkar M , Portelli S , Rezende P M , Velloso J P , Ascher D B , Pires D E . ThermoMutDB: a thermodynamic database for missense mutations. Nucleic Acids Research, 2021, 49(D1): D475–D479 CrossRef Google scholar

[45]	Stourac J , Dubrava J , Musil M , Horackova J , Damborsky J , Mazurenko S , Bednar D . FireProtDB: database of manually curated protein stability data. Nucleic Acids Research, 2021, 49(D1): D319–D324 CrossRef Google scholar

[46]	Nikam R , Kulandaisamy A , Harini K , Sharma D , Gromiha M M . ProThermDB: thermodynamic database for proteins and mutants revisited after 15 years. Nucleic Acids Research, 2021, 49(D1): D420–D424 CrossRef Google scholar

[47]	Heid E , Probst D , Green W H , Madsen G K . EnzymeMap: curation, validation and data-driven prediction of enzymatic reactions. Chemical Science, 2023, 48(14): 14229–14242 CrossRef Google scholar

[48]	Probst D , Manica M , Nana Teukam Y G , Castrogiovanni A , Paratore F , Laino T . Biocatalysed synthesis planning using data-driven learning. Nature Communications, 2022, 13(1): 964 CrossRef Google scholar

[49]	Ganter M , Bernard T , Moretti S , Stelling J , Pagni M . MetaNetX. org: a website and repository for accessing, analysing and manipulating metabolic networks. Bioinformatics, 2013, 29(6): 815–816 CrossRef Google scholar

[50]

Hafner J , MohammadiPeyhani H , Sveshnikova A , Scheidegger A , Hatzimanikatis V . MohammadiPeyhani H, Sveshnikova A, Scheidegger A, Hatzimanikatis V. Updated atlas of biochemistry with new metabolites and improved enzyme prediction power. ACS Synthetic Biology, 2020, 9(6): 1479–1482

CrossRef Google scholar

[51]	Wishart D S , Li C , Marcu A , Badran H , Pon A , Budinski Z , Patron J , Lipton D , Cao X , Oler E . . PathBank: a comprehensive pathway database for model organisms. Nucleic Acids Research, 2020, 48(D1): D470–D478 CrossRef Google scholar

[52]	Wittig U , Rey M , Weidemann A , Kania R , Müller W . SABIO-RK: an updated resource for manually curated biochemical reaction kinetics. Nucleic Acids Research, 2018, 46(D1): D656–D660 CrossRef Google scholar

[53]	Afify H M , Abdelhalim M B , Mabrouk M S , Sayed A Y . Protein secondary structure prediction (PSSP) using different machine algorithms. Egyptian Journal of Medical Human Genetics, 2021, 22(1): 1–10 CrossRef Google scholar

[54]	Liu B , Wang X , Lin L , Tang B , Dong Q , Wang X . Prediction of protein binding sites in protein structures using hidden Markov support vector machine. BMC Bioinformatics, 2009, 10(1): 1–14 CrossRef Google scholar

[55]	Palla M , Punthambaker S , Stranges B , Vigneault F , Nivala J , Wiegand D , Ayer A , Craig T , Gremyachinskiy D , Franklin H . . Multiplex single-molecule kinetics of nanopore-coupled polymerases. ACS Nano, 2021, 15(1): 489–502 CrossRef Google scholar

[56]	Fang X , Huang J , Zhang R , Wang F , Zhang Q , Li G , Yan J , Zhang H , Yan Y , Xu L . Convolution neural network-based prediction of protein thermostability. Journal of Chemical Information and Modeling, 2019, 59(11): 4833–4843 CrossRef Google scholar

[57]	Gelman S , Fahlberg S A , Heinzelman P , Romero P A , Gitter A . Neural networks to learn protein sequence-function relationships from deep mutational scanning data. Proceedings of the National Academy of Sciences of the United States of America, 2021, 118(48): e2104878118 CrossRef Google scholar

[58]	Mellor J , Grigoras I , Carbonell P , Faulon J L . Semisupervised gaussian process for automated enzyme search. ACS Synthetic Biology, 2016, 5(6): 518–528 CrossRef Google scholar

[59]	Pires D E , Ascher D B , Blundell T L . mCSM: predicting the effects of mutations in proteins using graph-based signatures. Bioinformatics, 2014, 30(3): 335–342 CrossRef Google scholar

[60]	Hakala K , Kaewphan S , Björne J , Mehryary F , Moen H , Tolvanen M , Salakoski T , Ginter F . Neural network and random forest models in protein function prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2022, 19(3): 1772–1781 CrossRef Google scholar

[61]	Kathuria C , Mehrotra D , Misra N K . Predicting the protein structure using random forest approach. Procedia Computer Science, 2018, 132: 1654–1662 CrossRef Google scholar

[62]	Wang C , Chen Y , Zhang Y , Li K , Lin M , Pan F , Wu W , Zhang J . A reinforcement learning approach for protein-ligand binding pose prediction. BMC Bioinformatics, 2022, 23(1): 1–18 CrossRef Google scholar

[63]	Ryu J Y , Kim H U , Lee S Y . Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers. Proceedings of the National Academy of Sciences of the United States of America, 2019, 116(28): 13996–14001 CrossRef Google scholar

[64]	Dalkiran A , Rifaioglu A S , Martin M J , Cetin A R , Atalay V , Doğan T . ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature. BMC Bioinformatics, 2018, 19(1): 1–13 CrossRef Google scholar

[65]	Zou Z , Tian S , Gao X , Li Y . mlDEEPre: multi-functional enzyme function prediction with hierarchical multi-label deep learning. Frontiers in Genetics, 2019, 9: 714 CrossRef Google scholar

[66]

Cadet F , Fontaine N , Li G , Sanchis J , Ng F C M , Pandjaitan R , Vetrivel I , Offmann B , Reetz M T . A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes. Scientific Reports, 2018, 8(1): 16757

CrossRef Google scholar

[67]	Robinson S L , Smith M D , Richman J E , Aukema K G , Wackett L P . Machine learning-based prediction of activity and substrate specificity for OleA enzymes in the thiolase superfamily. Synthetic Biology, 2020, 5(1): ysaa004 CrossRef Google scholar

[68]	Robinson S L , Terlouw B R , Smith M D , Pidot S J , Stinear T P , Medema M H , Wackett L P . Global analysis of adenylate-forming enzymes reveals β-lactone biosynthesis pathway in pathogenic nocardia. Journal of Biological Chemistry, 2020, 295(44): 14826–14839 CrossRef Google scholar

[69]	Song J , Li F , Takemoto K , Haffari G , Akutsu T , Chou K C , Webb G I . Prevail, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework. Journal of Theoretical Biology, 2018, 443: 125–137 CrossRef Google scholar

[70]	Torng W , Altman R B . High precision protein functional site detection using 3D convolutional neural networks. Bioinformatics, 2019, 35(9): 1503–1512 CrossRef Google scholar

[71]	Somarowthu S , Yang H , Hildebrand D G , Ondrechen M J . High-performance prediction of functional residues in proteins with machine learning and computed input features. Biopolymers, 2011, 95(6): 390–400 CrossRef Google scholar

[72]	Li G , Rabe K S , Nielsen J , Engqvist M K . Machine learning applied to predicting microorganism growth temperatures and enzyme catalytic optima. ACS Synthetic Biology, 2019, 8(6): 1411–1420 CrossRef Google scholar

[73]

Foroozandeh S M , Farhadyar K , Kavousi K , Azarabad M H , Boroomand A , Ariaeenejad S , Hosseini S G . A generalized machine-learning aided method for targeted identification of industrial enzymes from metagenome: a xylanase temperature dependence case study. Biotechnology and Bioengineering, 2021, 118(2): 759–769

CrossRef Google scholar

[74]	Li F , Yuan L , Lu H , Li G , Chen Y , Engqvist M K , Kerkhoven E J , Nielsen J . Deep learning-based kcat prediction enables improved enzyme-constrained model reconstruction. Nature Catalysis, 2022, 5(8): 662–672 CrossRef Google scholar

[75]	Xie W J , Asadi M , Warshel A . Enhancing computational enzyme design by a maximum entropy strategy. Proceedings of the National Academy of Sciences of the United States of America, 2022, 119(7): e2122355119 CrossRef Google scholar

[76]	Shroff R , Cole A W , Diaz D J , Morrow B R , Donnell I , Annapareddy A , Gollihar J , Ellington A D , Thyer R . Discovery of novel gain-of-function mutations guided by structure-based deep learning. ACS Synthetic Biology, 2020, 9(11): 2927–2935 CrossRef Google scholar

[77]	Ostafe R , Fontaine N , Frank D , Ng F C M , Prodanovic R , Pandjaitan R , Offmann B , Cadet F , Fischer R . One-shot optimization of multiple enzyme parameters: tailoring glucose oxidase for pH and electron mediators. Biotechnology and Bioengineering, 2020, 117(1): 17–29 CrossRef Google scholar

[78]	Høie M H , Cagiada M , Frederiksen A H B , Stein A , Lindorff Larsen K . Predicting and interpreting large-scale mutagenesis data using analyses of protein stability and conservation. Cell Reports, 2022, 38(2): 110207 CrossRef Google scholar

[79]	Chew A K , Jiang S , Zhang W , Zavala V M , Van Lehn R C . Fast predictions of liquid-phase acid-catalyzed reaction rates using molecular dynamics simulations and convolutional neural networks. Chemical Science, 2020, 11(46): 12464–12476 CrossRef Google scholar

[80]	RanXJiangYShaoQYangZ J. EnzyKR: a chirality-aware deep learning model for predicting the outcomes of the hydrolase-catalyzed kinetic resolution. Chemical Science, 2023, 14(43): 12073–12082

[81]	Wang S , Tang H , Zhao Y , Zuo L . BayeStab: predicting effects of mutations on protein stability with uncertainty quantification. Protein Science, 2022, 31(11): e4467 CrossRef Google scholar

[82]	Iqbal S , Ge F , Li F , Akutsu T , Zheng Y , Gasser R B , Yu D J , Webb G I , Song J . PROST: Alphafold2-aware sequence-based predictor to estimate protein stability changes upon missense mutations. Journal of Chemical Information and Modeling, 2022, 62(17): 4270–4282 CrossRef Google scholar

[83]	Hernández I M , Dehouck Y , Bastolla U , López-Blanco J R , Chacón P . Predicting protein stability changes upon mutation using a simple orientational potential. Bioinformatics, 2023, 39(1): btad011 CrossRef Google scholar

[84]	Pak M A , Markhieva K A , Novikova M S , Petrov D S , Vorobyev I S , Maksimova E S , Kondrashov F A , Ivankov D N . Using Alphafold to predict the impact of single mutations on protein stability and function. PLoS One, 2023, 18(3): e0282689 CrossRef Google scholar

[85]	Gado J E , Beckham G T , Payne C M . Improving enzyme optimum temperature prediction with resampling strategies and ensemble learning. Journal of Chemical Information and Modeling, 2020, 60(8): 4098–4107 CrossRef Google scholar

[86]	Yang Y , Zeng L , Vihinen M . PON-Sol2: prediction of effects of variants on protein solubility. International Journal of Molecular Sciences, 2021, 22(15): 8027 CrossRef Google scholar

[87]	Russ W P , Figliuzzi M , Stocker C , Barrat-Charlaix P , Socolich M , Kast P , Hilvert D , Monasson R , Cocco S , Weigt M . An evolution-based model for designing chorismate mutase enzymes. Science, 2020, 369(6502): 440–445 CrossRef Google scholar

[88]	Mak W S , Wang X , Arenas R , Cui Y , Bertolani S , Deng W Q , Tagkopoulos I , Wilson D K , Siegel J B . Discovery, design, and structural characterization of alkane-producing enzymes across the ferritin-like superfamily. Biochemistry, 2020, 59(40): 3834–3843 CrossRef Google scholar

[89]	Dauparas J , Anishchenko I , Bennett N , Bai H , Ragotte R J , Milles L F , Wicky B I , Courbet A , de Haas R J , Bethel N . . Robust deep learning-based protein sequence design using ProteinMPNN. Science, 2022, 378(6615): 49–56 CrossRef Google scholar

[90]	Watson J L , Juergens D , Bennett N R , Trippe B L , Yim J , Eisenach H E , Ahern W , Borst A J , Ragotte R J , Milles L F . . De novo design of protein structure and function with RFdiffusion. Nature, 2023, 620(7976): 1089–1100 CrossRef Google scholar

[91]	Wu K E , Yang K K , van den Berg R , Alamdari S , Zou J Y , Lu A X , Amini A P . Berg R V D, Zou J Y, Lu A X, et al. Protein structure generation via folding diffusion. Nature Communications, 2024, 15(1): 1059 CrossRef Google scholar

[92]	ZhangZXuMJamasbAChenthamarakshanVLozanoADasPTangJ. Protein representation learning by geometric structure pretraining. Preprint arXiv: 2203.06125, 2022

[93]	Zhang J , Wang H , Luo Z , Yang Z , Zhang Z , Wang P , Li M , Zhang Y , Feng Y , Lu D . . Computational design of highly efficient thermostable MHET hydrolases and dual enzyme system for PET recycling. Communications Biology, 2023, 6(1): 1135 CrossRef Google scholar

[94]	Xu A , Zhou J , Blank L M , Jiang M . Future focuses of enzymatic plastic degradation. Trends in Microbiology, 2023, 31(7): 668–671 CrossRef Google scholar

[95]	Zhang Y . A relay for improving the catalytic efficiency and thermostability of PET hydrolases. Chem Catalysis, 2022, 2(10): 2420–2422 CrossRef Google scholar

[96]	SchymkowitzJBorgJStricherFNysRRousseauFSerranoL. The FoldX web server: an online force field. Nucleic Acids Research, 2005, 33: W382–W388

[97]	GuptaAAgrawalS. Machine learning-based enzyme engineering of PETase for improved efficiency in plastic degradation. Journal of Emgerging Investigators, 2023, 6: doi:10.59720/22-016

[98]	DingYZhangSHessHKongXZhangY. Replicating enzymatic activity by positioning active sites with synthetic protein scaffolds. BioRxiv, 2024, bioRxiv 2024.01.31.577620

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

This work was supported by the National Natural Science Foundation of China under grant number 32371325, the Seed Funding of China Petrochemical Corporation (Sinopec Group) under grant number 223260, and the Fundamental Research Funds for the Central Universities (QNTD2023-01).