1 Introduction
Synthetic binding proteins (SBPs) with small size, marked solubility and stability, and high binding affinity are essential for protein-based research, treatment, and diagnostics [
1]. SBPs were usually designed by site-directed mutagenesis and directed evolution techniques based on a privileged protein scaffold [
2]. Over the past several decades, the number of SBPs has significantly increased. According to SYNBIP [
2], a comprehensive database of synthetic binding proteins-related information, there are currently over 2,000 known SBPs have been discovered and with 74 SBPs have entered the clinic [
2]. However, the classical techniques have limitations in expanding the protein space of SBPs [
3]. Site-directed mutagenesis is highly dependent on the physiological properties and structure of parental protein, which limits the application of mutation results obtained from a single protein to other proteins [
3], and it may also be hampered by the lack of high-resolution crystal structures [
4]. While directed evolution explores only the sequence space regions around natural proteins, which means that this method only modifies certain amino acids in the protein sequence, and the quality of generated libraries is not guaranteed [
5,
6]. Protein
de novo design uses physicochemical principles and molecular interactions to generate new protein variants, allowing for complete protein remodeling without stringent requirements on parent proteins. At the same time, the positions of the modified amino acids are relatively extensive [
3]. Meanwhile, protein
de novo design software based on physical methods relies on energy functions to predict the stability and folding behavior [
3,
7-
12]. However, current energy functions may not be sufficient to accurately simulate the complex folding and interactions of proteins in nature [
13], this has led to instances of design failures. Since deep learning (DL) has significant advantages in extracting features and integrating information from proteins [
14], so it can uncover subtle patterns and correlations that traditional methods might overlook. Therefore, exploring synthetic binding protein design using DL-based
de novo protein design is urgent and of great significance [
15].
DL has been applied to multiple fields, including protein design. Modes such as ProteinSolver [
16] and ProteinBERT [
17] enable fast and flexible design based on real protein sequences and structures. In addition, deep networks that predict the structure of natural proteins are being inverted to design entirely new proteins [
18], as well as creating entirely new luciferase enzymes by way of ‘full family illusions’ [
19]. The advantage of
de novo protein design based on DL is that the computer learns a lot about the existing protein structure and fully understands what sequences are reasonable and popular [
20-
22], so the generated sequences are also more reasonable and likely to be popular [
20].
More recently, DL-based
de novo protein design has seen a breakthrough with the release of ProteinMPNN [
23]. As an advanced DL framework, ProteinMPNN [
23] achieves an unprecedented average sequence recovery rate of 52.4% and the design of a new protein containing 100 residues in 1.2 s on a single CPU [
23]. For ProteinMPNN [
23], the training 23,358 clusters (more than 500,000 crystal structures) were selected from the Protein Data Bank (PDB) [
24]. Compared to other DL-based protein designs, ProteinMPNN not only has more model training data but also designs protein sequences quickly, with a high success rate and a wide range of applications. However, a thorough study of SYNBIP [
2] revealed that only a very small number (245) of SBPs have crystal structures, and it is reasonable to speculate that there are few crystal structures for SBPs were used for ProteinMPNN training. Thus, to expand the number of SBPs by DL-based protein sequence design methods like ProteinMPNN [
25] is of great necessity to extensively explore their applicability to SBPs with different scaffolds [
26].
In this work, starting from the 3D structures of SBPs collected in SYNBIP database [
2], ProteinMPNN [
23] was first used for
de novo design of synthetic protein sequences. Then, comprehensive bioinformatics analysis suggested that the new protein sequences have improved solubility and stability compared with the original SBPs. Furthermore, the differences between input structures in two forms including SBP (monomer) and the SBP bound to protein target (complex) were explored. It is found that the sequences generated based on monomer structure are better than that of complex structure in terms of solubility and stability. In contrast, the sequences designed based on the complex structure have better performance in calculated binding energy. Meanwhile, synthetic proteins designed by ProteinMPNN show superior performance compared to classical protein engineering methods. Overall, the DL-based protein sequence design framework ProteinMPNN is proven to be suitable for potential synthetic protein sequences generation. It shows great potential to overcome the limitation of protein engineering through the rapid generation of high-quality protein library, which provides scientists with new ideas to expand the sequence space of synthetic binding proteins.
2 Materials and methods
2.1 Input structures preparation
1,327 monomer structures were prepared including 1,100 trRosetta [
27] predicted ones and 227 experimental ones were retrieved from the SYNBIP database [
2]. The selected proteins are all single-chain SBPs with available structures from the SYNBIP database. The number of SBPs is consistent with the SYNBIP database. The reliability of predicted structures is discussed in the “Reliability of predicted structures” section of Supplementary Materials.
61 complex structures were prepared including 1 crystal structure [
28] and 60 computational models [
29]. In the preliminary stage, at least one representative SBP was selected for each protein scaffold. These SBPs were docked with their respective targets, resulting in 60 successfully docked complex structures. To facilitate comparative analysis across various sources of complex structures, additional one crystal complex structure was chosen from the successful docking complexes. Meanwhile, to control variables, additional 61 monomer structures were extracted from the complex ones and used for synthetic proteins design in the comparative analysis.
2.2 Generation of synthetic proteins using ProteinMPNN
Based on the prepared PDB files of 61 complex structures and 1,327 monomer structures, the DL-based framework of ProteinMPNN (See github.com/dauparas/ProteinMPNN/ website) [
23] was used for new synthetic proteins generation. The training set for ProteinMPNN is protein assemblies in the PDB (as of August 2, 2021) with a resolution higher than 3.5 Å and less than 10,000 residues as determined by X-ray or Cryo-EM. Among them, the structures of 216 SBPs (95%) are included in the training set. Specific protein data for the training set can be downloaded from the website of files.ipd.uw.edu/pub/training_sets/pdb_2021aug02.tar.gz/. At different temperatures, SBPs in the monomer group were designed, and the sequence recovery exhibited an inverse trend with temperature (Fig.1). As the sequence recovery is a crucial evaluation criterion for
de novo protein design [
30], so the temperature parameter of ProteinMPNN is set to 0.1 in this work. For each SBP, the number of amino acid sequences generated is set to 5.
2.3 Analysis of sequences similarity to proteins in protein data bank
The sequences similarity between the generated new synthetic proteins and proteins stored in PDB was analyzed by in-house Python script based on MMseqs2 algorithm [
31]. The selection of sequence similarity threshold depends on the specific task. Generally, sequences with a similarity of less than 50% are considered non-homologous [
32]. A lower threshold of 30% was chosen to ensure greater diversity in the generated sequences, thus verifying their non-homologous nature. MMseqs2 software [
31] is employed for discovering similar protein sequences. In terms of sequence similarity search, MMseqs2 software outperforms common methods like BLAST [
33] in terms of performance at comparable sensitivity levels.
2.4 Properties analysis of designed synthetic proteins
2.4.1 Solubility analysis
Protein-Sol [
34] was used for solubility calculation. The predicted solubility in Protein-Sol is a number from 0 to 1. When the value is higher than 0.45, it is considered that the solubility of this protein is better than the average soluble
E. coli protein [
35], that is, higher values mean better solubility. Out of the 6,635 amino acid sequences generated based on SBPs in monomer states, 65 were shorter than 20 in length, and 23 were identified as membrane proteins. For standardized statistics, these 88 amino acid sequences and other amino acid sequences generated from the same SBPs were not considered. Protein-Sol is a webserver and a program written by Python was used for automated calculation in this work.
2.4.2 Stability analysis
ProtParam method [
36] integrated in TBtools [
37] was used for thermal stability calculation. The instability coefficient is widely used to evaluate the instability of amino acid sequences
in vivo. Protein with an instability index less than or equal to 40 is predicted to be stable, while protein with a value higher than 40 may be unstable. In other words, the smaller the value, the higher stability it is.
2.5 Analysis of binding energy between synthetic proteins and targets
To analyze the binding energy between synthetic proteins and targets, the new generated proteins were first docked to the binding site of protein target using PyMOL software [
38]. Then, PDBePISA [
39] was used for binding energy estimation. For five designs of each SBP, the one with the highest binding energy was selected for comparison.
2.6 Statistical analysis
Data analysis was performed using SPSS Statistics 29.0 software. To compare the solubility, stability, and binding energy between different design methods for the same SBP and determine if the differences were significant, a paired t-test was conducted under the assumption that the sample differences followed a normal distribution. Confirm in the Supplementary Materials that the sample differences follow a normal distribution. The null hypothesis was that the mean difference equals zero, with a significance level set at P < 0.05 to denote statistical significance. Specific calculation data can be found in the Supplementary Materials.
3 Results and discussion
3.1 7,245 new synthetic proteins designed by ProteinMPNN
Using the PDB files of 61 complex structures [
29] and 61 monomer structures extracted from the complex structures (the details can be found in Section 2.1), and 1,327 monomer structures retrieved in SYNBIP database [
2] as the inputs, new synthetic proteins were designed by the DL-based framework ProteinMPNN [
23]. In this work, the temperature and the number of output sequences for each input SBP were set to 0.1 and 5, respectively. As a result, 7,245 synthetic binding proteins were successfully designed, which covering 55 scaffolds and 342 potential targets (
Appendix A). Among them, 78.33% of the designed synthetic proteins have less than 200 amino acids. All the designed new synthetic protein sequences were used for further analysis.
3.2 Excellent sequence recovery of the designed synthetic proteins
Sequence recovery is recognized as an important criterion for
de novo protein design evaluation [
40]. Herein, the average sequence recovery for 6,635 sequences generated by SBP in monomer state is as high as 47.27%. In addition, there is no significant difference between the sequence recovery of 61 representative designs with SBPs in monomer and complex states (Fig.2), which has the average value of 47.50% and 47.59%, respectively. The result indicated that although the decoding methods of ProteinMPNN are different for input structures, the accuracy of the algorithms for monomer structure at the sequence recovery is the same as that of and complex structure basically. Therefore, the ProteinMPNN-generated synthetic protein sequences basically share similar backbone structure with the parental SBPs (natural or artificial), which are characterized by sequence recovery and could be verified by protein 3D structure prediction [
41].
3.3 Sequence diversity of synthetic proteins designed by ProteinMPNN
In addition to sequence recovery, it is important to analyze the diversity of new designed synthetic protein sequences [
42]. In this work, a search was conducted on 6,635 sequences generated by SBP in monomer state. 516 amino acid sequences didn’t receive search results. It can be concluded that these 516 amino acid sequences have no homologous proteins. In other words, the
de novo designed proteins don’t have direct relationship with any proteins present in nature. At the same time, 551 amino acid sequences with similarity less than or equal to 30%. This indicates that the ProteinMPNN design will result in novel amino acid sequences. Although most of the amino acid sequences designed by ProteinMPNN have over 30% sequence similarity, it also indirectly reflects the powerful ability and accuracy of ProteinMPNN to learn data from PDB databases.
Moreover, there are 20 repetitive sequences (0.30%) among the 6,635 ProteinMPNN-generated amino acid sequences in monomer state were identified in this work (Tab.1). The repetitive protein sequences are mainly derived from proteins with shorter sequence lengths. At the same time, observing the 12 original synthetic binding proteins with amino acid sequence lengths less than 20, it was found that the similarity between the generated protein sequences is extremely high. It can be concluded that when designing short sequence synthetic binding proteins, their repeatability is relatively high. The reason for this situation may be that during the training process of deep learning ProteinMPNN, there are too few samples of short sequences. It was found that proteins with amino acid sequence lengths less than or equal to 20 accounted for about 6% of all proteins in PDB database, and most of them didn’t meet the selection criteria of ProteinMPNN.
3.4 Improved protein solubility of the designed synthetic proteins
Solubility is a fundamental concept that is crucial in protein science [
43,
44]. As a protein trait determined by its primary amino acid sequence and environmental conditions, this concept is important for structural and biophysical research [
45]. Here to explore the probability of improved solubility of generated amino acid sequences compared to original SBP, proportion of increased solubility (
PSL) is defined.
PSL is the number of generated amino acid sequences with solubility better than that of original SBP (
Nb) divided by the total number of generated amino acid sequences (
Nt = 5). For each original SBP, if the
PSL is equal or greater than 0.6, the solubility of at least 3 out of the 5 designed amino acid sequences is improved compared to the original SBPs. The predicted solubility data for each designed protein sequence and the original SBP can be found in
Appendix A.
3.4.1 Solubility analysis of the sequences designed from monomer SBPs
Fig.3(a) shows the quantity distribution of different
PSL, and it can be concluded that most of the designed amino acid sequences (refers to 72% of the original SBPs) show improved solubility. Particularly, the solubility of the amino acid sequences designed from 46% of the original SBPs was all improved (Fig.3(a)), demonstrating that ProteinMPNN [
23] has significant advantages in improving the solubility of SBPs.
In addition, the distribution of SBPs with PSL of 1 in the corresponding SBP scaffolds is also plotted. As shown in Fig.3(b), when the proportion of those with PSL of 1 in the total number of original SBPs corresponding to the same scaffolds exceeds 0.5, it means that the solubility of the amino acid sequences designed by most of the original SBPs to which the scaffold belong has been improved. It is believed that ProteinMPNN is extremely suitable for enhancing solubility in these 21 scaffolds (Neocarzinostatin based binder, scFv, Human VH dAb, Anticalin, OBody, Megabody, Nanobody, Diabody, CI2-based binder, Repebody, VL dAb, SH2 domain, Fab, Abdulin, Affilin, Alphabody, Evibody, Glubody, I-body, Transferrin based binder, and WW domain). Meanwhile, it should be noted that these 21 scaffolds only play a significant role in improving solubility, and does not represent poor solubility of amino acid sequences generated by other scaffolds. For example, the average solubility values of all amino acid sequences belonging to the 6 scaffolds Avimer, Cytochrome b562 based binder, dArmRP, GCN4 based binder, Im9 based binder, and PHD finger domain with a proportion of 0 in Fig.3(b) are 0.77, 0.83, 0.80, 0.88, 0.76, and 0.71 inFig.3(c), respectively, which still have a relatively high level of solubility.
In terms of average value and
PSL, ProteinMPNN performs well in both solubility and improving solubility. The reason for the good solubility of the designed amino acid sequence may be that most of the proteins used for DL are soluble ones. This is because the training ProteinMPNN protein structure was obtained from the PDB [
24], and most protein structures in PDB resolved by NMR, X-ray, and (or) cryo-EM experiments [
46] are soluble. Additionally, experimental validation has demonstrated that most proteins designed by ProteinMPNN are soluble [
23].
3.4.2 Comparison of solubility between sequences designed from monomer and complex SBPs
In Fig.4(a), the
PSL well demonstrated the difference in the number of amino acid sequences with enhanced solubility between sequences designed from monomer and complex SBPs. As shown, the amount of solubility improvement of amino acid sequences based on monomer design is slightly higher than that based on complex design when the value of
PSL > 0.6 (Fig.4(a)). In addition, the value of average solubility of amino acid sequences obtained based on monomer design (62%) is also slightly higher than the average solubility of amino acid sequences obtained based on complex design (Fig.4(b)), the differences are statistically significant (
P-value = 0.031). One possible reason is that ProteinMPNN was trained on both single-chain and multi-chain proteins [
23]. When designing protein sequences, it considers the foldability of the sequences and their compatibility with the target structure. When designing a protein based on monomers, the model can focus more on optimizing the protein’s own properties, including its solubility. However, in complex design, the model needs to consider both the interactions between the protein and the target, which may sacrifice some of the protein’s solubility to ensure the overall stability of the complex.
3.5 Improved thermal stability of the designed synthetic proteins
Stability is another important property of proteins in production, and the derivation of new protein functions during evolution depends heavily on the stability of the proteins [
47]. Consistent with solubility analysis, to explore the probability of improved stability of the designed amino acid sequence compared to the original SBPs, proportion of increased stability (
PST) is defined here.
PST is the number of design protein sequences with stability higher than that of the original SBPs (
Nh) divided by the total number of design amino acid sequences (
Nt = 5). For each original SBP, if the
PST is equal or greater than 0.6, the stability of at least 3 out of the 5 designed amino acid sequences is improved compared to the original SBPs. The predicted instability index for each designed protein sequence and the original SBP can be found in
Appendix A.
3.5.1 Stability analysis of the sequences designed from monomer SBPs
Fig.5(a) shows the quantity distribution of different
PST, and it can be concluded that most of the designed amino acid sequences (refers to 62% of the original SBPs) show improved stability. Particularly, the stability of the amino acid sequences designed from 38% of the original SBPs was all improved (Fig.5(a)), demonstrating that ProteinMPNN has significant advantages in improving the stability of SBPs. Because it has been experimentally validated that the majority of proteins generated by ProteinMPNN exhibit high thermal stability [
23]. Meanwhile, ProteinMPNN-generated sequences are predicted to fold to native protein backbones more confidently and accurately than the original native sequences, while natural proteins are generally stable in structure, most sequences generated by ProteinMPNN have improved stability.
In addition, the distribution of SBPs with
PST of 1 in the corresponding SBP scaffolds is also plotted. As shown in Fig.5(b), when the proportion of those with
PST of 1 in the total number of original SBPs corresponding to the same scaffolds exceeds 0.5, it means that the stability of the amino acid sequences designed by most of the original SBPs to which the scaffold belong has been improved. It is believed that ProteinMPNN [
23] is extremely suitable for enhancing stability in these 16 scaffolds (Neocarzinostatin based binder, Body, Fynomer, CI2 based binder, Avimer, Bicyclic peptide, Centyrin, scFv, Repebody, Fab, CBM based binder, Affilin, beta Roll domain, dArmRP, Evibody, beta Airpin mimetic).
Among the 12 scaffolds with a ratio of 0, 10 scaffolds only contain 2−5 original synthetic binding proteins, and contingency may be the reason why there is no proportion of increased stability of 1. The reason for other scaffold ratios of 0 or lower may be due to the important role of certain amino acids in protein stability, but the design of ProteinMPNN has modified it, such as the group IV WW domain, Pin1 in WW domain anti VEGFR-2 clone B1 [
48] and disulfide bonds in Nanobody anti VSG cAbAn33-1-1-1 [
49]. At the same time, after calculation, it was found that the variance of the stability of amino acid sequences in most groups is too large, indicating that the difference in stability between each group of amino acid sequences is too large. When using it, attention should be paid to selecting.
3.5.2 Comparison of stability between sequences designed from monomer and complex SBPs
In Fig.6(a), the PST well demonstrated the difference in the number of amino acid sequences with enhanced stability between sequences designed from monomer and complex SBPs. As shown, the amount of stability improvement of amino acid sequences based on monomer design is slightly higher than that based on complex design when the value of PST > 0.6 (Fig.6(a)). In addition, the value of average stability of amino acid sequences obtained based on monomer design (57%) is also slightly greater than the average stability of amino acid sequences obtained based on complex design (Fig.6(b)), differences are statistically significant (P-value = 0.014). The reason for this may be similar to the relatively higher solubility observed.
3.6 Eight incredibly suitable protein scaffolds for ProteinMPNN have been identified
The scaffolds that show improvement in both solubility and stability are considered highly suitable for ProteinMPNN. They are Neocarzinostatin-based binder, Diabody, CI2-based binder, scFv, Repebody, Fab, Affilin, and Evibody. Among these, Diabody, scFv, and Fab belong to the antibody scaffolds, while the remaining five scaffolds belong to the non-antibody scaffolds [
2].
Affilin scaffold comes in γ-B Crystallin or Ubiquitin [
50]. Chymotrypsin inhibitor 2 (CI2) is a protease inhibitor naturally present in the seeds of barley [
51]. Diabodies are small antibody fragments that have two antigen binding Fv domains [
52]. Evibody is based on human cytotoxic-associated Antigen (CTLA-4) [
53]. Fab refers to fragment antigen binding, which are smaller variants of monoclonal antibodies consisting of one VL and VH domain linked to their respective light and heavy constant domains [
54]. Neocarzinostatin belongs to the family of bacterial chromoproteins [
55]. Repebody is developed from Leucine-rich repeat (LRR) modules of variable lymphocyte receptors (VLRs) [
56]. Single-chain variable fragment (scFv) consists of one VL and VH domain fused by a flexible linker [
54]. Additional details of the 8 protein scaffolds are shown in Tab.2.
3.7 New synthetic proteins designed from complex structures shown better binding energy
To investigate the potential of designed sequences binding to the protein targets of their original SBPs, protein structure prediction and protein-protein interaction analysis were further conducted. First, based on 61 original SBPs in monomer and complex forms, 140 sequences from 14 sets of high-quality complex structures and 94 sequences with the lowest ProteinMPNN score (Lower score is better, the score represents model’s uncertainty about the predictions) from ProteinMPNN design [
23] were selected for protein structure prediction using AlphaFold2 [
41]. Then, each new SBP was docked to the binding site on the protein target of the original SBP and with the binding energy estimated by PDBePISA [
39].
The detailed information for protein-protein interaction analysis was provided in
Appendix B. Compared to the original SBPs and their protein target interactions, 49.18% (30/61) of the estimated binding energy for new SBPs was increased in the complex based design group, while only 27.87% (17/61) of that in the monomer-based design group. Meanwhile, 67.21% (41/61) of the estimated binding energy for new SBPs in the complex-based design group was higher than that in the monomer-based design group, differences are statistically significant (
P-value < 0.001). As shown in Fig.7, the majority of the new SBPs in the complex based design group have potential hydrogen bond interactions (85%) and salt bridges (52%). Additionally, some complexes also exhibit potential disulfide bonds (3%). For example, in the interaction interface between SBP001262 and its target, hydrogen bond interactions, and salt bridges were observed (Fig.8). Half of the estimated binding energy improvement of the complex based design group (49.18%) may be due to the consideration of protein-protein interaction in ProteinMPNN for protein function design [
23].
In addition, a small part of the estimated binding energy of the monomer based design group (27.87%) is increased, which further verify the rationality of the development of new SBPs from privileged scaffolds [
57]. ProteinMPNN has experimentally demonstrated that the designed proteins exhibit superior binding capacity [
23], moreover, the enhanced binding energy of new proteins designed based on monomer and (or) complex structure of their original SBPs shown great potential of DL-based formwork like ProteinMPNN in target-directed
de novo generation of SBP library with high quality, which could assist experimental biologists to rational protein engineering to discover functional binders.
3.8 ProteinMPNN demonstrates tremendous potential in designing SBPs compared to traditional protein engineering methods
Among the SBPs with crystal structures corresponding to their respective targets, Anticalins N7E and N9B [
58] were selected as representatives of traditional protein engineering. They were designed from the same template protein, Lipocalin 2. For comparative analysis, ProteinMPNN was used to design the same template protein, following the same design and analysis process as with other SBPs.
From Tab.3, it is evident to see that for the weaker affinity Anticalins N9B, 80% (4/5) the proteins designed by ProteinMPNN showed improved binding affinity. Compared to Anticalins N7E, one of the five designed proteins (PronteinMPNN-5) exhibited better binding affinity. In contrast to the predicted stability, the solubility of the designed proteins is largely improved. Although PronteinMPNN-5 demonstrated outstanding binding affinity, there was no improvement on the predicted solubility and stability. Therefore, it is still a great challenge for achieving good solubility and stability alongside strong binding affinity by ProteinMPNN. In addition, the average sequence recovery of 5 sequences designed from Lipocalin 2 is 46.4%. There are few sequences similar to the 5 sequences in the PDB database (Appendix B). More interestingly, a disulfide bond was discovered in the template protein Lipocalin 2. ProteinMPNN-5 not only folds into a structurally consistent form with substantial sequence changes but also retains the disulfide bond (Fig.9).
It is evident that compared to traditional protein engineering methods which may only yield a few or a dozen active proteins from a large library, ProteinMPNN shows significant potential by exhibiting outstanding performance in the 5 designed proteins.
4 Conclusion
Using the advanced DL algorithm ProteinMPNN, 7,245 synthetic protein sequences were generated based on the original SBPs collected in SYNBIP. Post analysis of the designed sequences demonstrating excellent performance sequence recovery, solubility and stability. 8 scaffolds that have greatly improved solubility and stability include Neocarzinostatin based binder, Diabody, CI2 based binder, scFv, Repebody, Fab, Affilin, and Evibody. In addition, the sequence generated based on monomer is better than that generated based on complex in terms of solubility and stability, while it is the opposite for their binding energies to target. ProteinMPNN has some limitations in designing SBPs. For example, in the design of SBPs with short sequences (< 20), repeated sequences may appear among the 5 designed sequences. Additionally, there may be occurrences of long stretches of alanine (A), which is not consistent with expectations. This may be due to uncertainty in the model predictions or poor quality of input backbone sequences. To address this issue, negative alanine bias can be added. Overall, the result prove that ProteinMPNN can quickly generate high-quality libraries of synthetic proteins for experimental screen of specific target, which can provide scientists with a wide range of opportunities to explore the world of synthetic proteins.