HPClas: A data-driven approach for identifying halophilic proteins based on catBoost

Shantong Hu; Xiaoyu Wang; Zhikang Wang; Menghan Jiang; Shihui Wang; Wenya Wang; Jiangning Song; Guimin Zhang

doi:10.1002/mlf2.12125

mLife ›› 2024, Vol. 3 ›› Issue (4) :515 -526. DOI: 10.1002/mlf2.12125

METHOD

HPClas: A data-driven approach for identifying halophilic proteins based on catBoost

Author information +

History +

PDF

Abstract

Halophilic proteins possess unique structural properties and show high stability under extreme conditions. This distinct characteristic makes them invaluable for application in various aspects such as bioenergy, pharmaceuticals, environmental clean-up, and energy production. Generally, halophilic proteins are discovered and characterized through labor-intensive and time-consuming wet lab experiments. In this study, we introduce the Halophilic Protein Classifier (HPClas), a machine learning-based classifier developed using the catBoost ensemble learning technique to identify halophilic proteins. Extensive in silico calculations were conducted on a large public dataset of 12,574 samples and HPClas achieved an area under the receiver operating characteristic curve (AUROC) of 0.844 on an independent test set of 200 samples. The source code and curated dataset of HPClas are publicly available at https://github.com/Showmake2/HPClas. In conclusion, HPClas can be explored as a promising tool to aid in the identification of halophilic proteins and accelerate their application in different fields.

Keywords

feature engineering / halophilic protein / machine learning

Cite this article

Download citation ▾

Shantong Hu, Xiaoyu Wang, Zhikang Wang, Menghan Jiang, Shihui Wang, Wenya Wang, Jiangning Song, Guimin Zhang. HPClas: A data-driven approach for identifying halophilic proteins based on catBoost. mLife, 2024, 3(4): 515-526 DOI:10.1002/mlf2.12125

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Ling C, Qiao GQ, Shuai BW, Olavarria K, Yin J, Xiang RJ, et al. Engineering NADH/NAD⁺ ratio in Halomonas bluephagenesis for enhanced production of polyhydroxyalkanoates (PHA). Metab Eng. 2018;49:275–286.

[2]	Graziano G, Merlino A. Molecular bases of protein halotolerance. Biochim Biophys Acta. 2014;1844:850–858.

[3]	Sivakumar N, Li N, Tang JW, Patel BKC, Swaminathan K. Crystal structure of AmyA lacks acidic surface and provide insights into protein stability at poly-extreme condition. FEBS Lett. 2006;580:2646–2652.

[4]	Sinha R, Khare SK. Effect of organic solvents on the structure and activity of moderately halophilic Bacillus sp. EMB9 protease. Extremophiles. 2014;18:1057–1066.

[5]	Littlechild JA. Enzymes from extreme environments and their industrial applications. Front Bioeng Biotechnol. 2015;3:161.

[6]	Sharma N, Farooqi MS, Chaturvedi KK, Lal SB, Grover M, Rai A, et al. The Halophile protein database. Database. 2014;2014:bau114.

[7]	Gunde-Cimerman N, Plemenitaš A, Oren A. Strategies of adaptation of microorganisms of the three domains of life to high salt concentrations. FEMS Microbiol Rev. 2018;42:353–375.

[8]	Zhang G, Ge H. Protein hypersaline adaptation: insight from amino acids with machine learning algorithms. Protein J. 2013;32:239–245.

[9]	Zhang G, Huihua G, Yi L. Stability of halophilic proteins: from dipeptide attributes to discrimination classifier. Int J Biiol Macromol. 2013;53:1–6.

[10]	Boutet E, Lieberherr D, Tognolli M, Schneider M, Bansal P, Bridge AJ, et al. UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: how to use the entry view. Methods Mol Biol. 2016;1374:23–54.

[11]	Nielsen H. Predicting secretory proteins with SignalP. Methods Mol Biol. 2017;1611:59–73.

[12]	Makhdoumi-Kakhki A, Amoozegar MA, Ventosa A. Salinibacter iranicus sp. nov. and Salinibacter luteus sp. nov., isolated from a salt lake, and emended descriptions of the genus Salinibacter and of Salinibacter ruber. Int J Syst Evol Microbiol. 2012;62:1521–1527.

[13]	Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbiased boosting with categorical features. Adv Neural Inf Process Syst. 2018;31:6639–6649.

[14]	Ogunleye A, Wang Q-G. XGBoost model for chronic kidney disease diagnosis. IEEE/ACM Trans Comput Biol Bioinf. 2020;17:2131–2140.

[15]	Breiman L. Random forests. Mach Learn. 2001;45:5–32.

[16]	Wang X, Li F, Xu J, Rong J, Webb GI, Ge Z, et al. ASPIRER: a new computational approach for identifying non-classical secreted proteins based on deep learning. Brief. Bioinform. 2022;23:bbac031.

[17]	Wen P, Xu Q, Yang Z, He Y, Huang Q. Exploring the algorithm-dependent generalization of AUPRC optimization with list stability. Adv Neural Inf Process Syst. 2022;35:28335–28349.

[18]	Bhasin M, Raghava GPS. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Res. 2004;32:W414–W419.

[19]	Imamoto Y, Kataoka M. Structure and photoreaction of photoactive yellow protein, a structural prototype of the PAS domain superfamily. Photochem Photobiol. 2007;83:40–49.

[20]	Brown K, Nurizzo D, Besson S, Shepard W, Moura J, Moura I, et al. MAD structure of Pseudomonas nautica dimeric cytochrome c552 mimicks the c4 Dihemic cytochrome domain association. J Mol Biol. 1999;289:1017–1028.

[21]	Sandomenico A, Leonardi A, Berisio R, Sanguigno L, Focà G, Focà A, et al. Generation and characterization of monoclonal antibodies against a cyclic variant of hepatitis C virus E2 epitope 412-422. J Virol. 2016;90:3745–3759.

[22]	Warden AC, Williams M, Peat TS, Seabrook SA, Newman J, Dojchinov G, et al. Rational engineering of a mesohalophilic carbonic anhydrase to an extreme halotolerant biocatalyst. Nat Commun. 2015;6:10278.

[23]	Premkumar L, Greenblatt HM, Bageshwar UK, Savchenko T, Gokhman I, Sussman JL, et al. Three-dimensional structure of a halotolerant algal carbonic anhydrase predicts halotolerance of a mammalian homolog. Proc Natl Acad Sci USA. 2005;102:7493–7498.

[24]	Somalinga V, Buhrman G, Arun A, Rose RB, Grunden AM. A high-resolution crystal structure of a psychrohalophilic α-carbonic anhydrase from Photobacterium profundum reveals a unique dimer interface. PLoS One. 2016;11:e0168022.

[25]	Arai S, Yonezawa Y, Ishibashi M, Matsumoto F, Adachi M, Tamada T, et al. Structural characteristics of alkaline phosphatase from the moderately halophilic bacterium Halomonas sp. 593. Acta Crystallogr D Biol Crystallogr. 2014;70:811–820.

[26]	Talon R, Coquelle N, Madern D, Girard E. An experimental point of view on hydration/solvation in halophilic proteins. Front Microbiol. 2014;5:66.

[27]	Yamamura A, Ichimura T, Kamekura M, Mizuki T, Usami R, Makino T, et al. Molecular mechanism of distinct salt-dependent enzyme activity of two halophilic nucleoside diphosphate kinases. Biophys J. 2009;96:4692–4700.

[28]	Arai S, Yonezawa Y, Okazaki N, Matsumoto F, Tamada T, Tokunaga H, et al. A structural mechanism for dimeric to tetrameric oligomer conversion in Halomonas sp. nucleoside diphosphate kinase. Prot Sci. 2012;21:498–510.

[29]	Bracken CD, Neighbor AM, Lamlenn KK, Thomas GC, Schubert HL, Whitby FG, et al. Crystal structures of a halophilic archaeal malate synthase from Haloferax volcanii and comparisons with isoforms A and G. BMC Struct Biol. 2011;11:23.

[30]	Altermark B, Helland R, Moe E, Willassen NP, Smalås AO., Structural adaptation of endonuclease I from the cold-adapted and halophilic bacterium Vibrio salmonicida. Acta Crystallogr D. 2008;64:368–376.

[31]	Tan TC, Mijts BN, Swaminathan K, Patel BKC, Divne C. Crystal structure of the polyextremophilic α-amylase AmyB from Halothermothrix orenii: details of a productive enzyme–substrate complex and an N domain with a role in binding raw starch. J Mol Biol. 2008;378:852–870.

[32]	Binbuga B, Boroujerdi AFB, Young JK. Structure in an extreme environment: NMR at high salt. Prot Sci. 2007;16:1783–1787.

[33]	Mangalathu S, Hwang S-H, Jeon J-S. Failure mode and effects analysis of RC members based on machine-learning-based SHapley Additive exPlanations (SHAP) approach. Eng Struct. 2020;219:110927.

[34]	Adadi A, Berrada M. Peeking inside the black box: a survey on explainable artificial intelligence (XAI). IEEE Access. 2018;6:52138–52160.

[35]	Lin H, Ding H. Predicting ion channels and their types by the dipeptide mode of pseudo amino acid composition. J Theor Biol. 2011;269:64–69.

[36]	Lin H, Li Q-Z. Using pseudo amino acid composition to predict protein structural class: approached by incorporating 400 dipeptide components. J Comput Chem. 2007;28:1463–1466.

[37]	Li LQ, Zhang Y, Zou LY, Zhou Y, Zheng XQ. Prediction of protein subcellular multi-localization based on the general form of Chou’s pseudo amino acid composition. Protein Peptide Lett. 2012;19:375–387.

[38]	Bhasin M, Raghava GPS. Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem. 2004;279:23262–23266.

[39]	Feng C, Ma Z, Yang D, Li X, Zhang J, Li Y. A method for prediction of thermophilic protein based on reduced amino acids and mixed features. Front Bioeng Biotechnol. 2020;8:285.

[40]	Chou K-C. Using pair-coupled amino acid composition to predict protein secondary structure content. J Protein Chem. 1999;18:473–480.

[41]	Sayers EW, Beck J, Bolton EE, Bourexis D, Brister JR, Canese K, et al. Database resources of The National Center for Biotechnology Information. Nucleic Acids Res. 2021;49:D10–D17.

[42]	Teufel F, Almagro Armenteros JJ, Johansen AR, Gíslason MH, Pihl SI, Tsirigos KD, et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat Biotechnol. 2022;40:1023–1025.

[43]	Krogh A, Larsson B, von Heijne G, Sonnhammer ELL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001;305:567–580.

[44]	Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28:3150–3152.

[45]	Boutet E, Lieberherr D, Tognolli M, Schneider M, Bairoch A. UniProtKB/Swiss-Prot. Methods Mol Biol. 2007;406:89–112.

[46]	Kim H, Yu SM. Chryseobacterium salivictor sp. nov., a plant-growth-promoting bacterium isolated from freshwater. Antonie Van Leeuwenhoek. 2020;113:989–995.

[47]	Steinegger M, Söding J. MMseqs. 2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35:1026–1028.

[48]	Chen Z, Zhao P, Li F, Leier A, Marquez-Lago TT, Wang Y, et al. iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics. 2018;34:2499–2502.

[49]	Chen Z, Zhao P, Li F, Marquez-Lago TT, Leier A, Revote J, et al. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform. 2020;21:1047–1057.

[50]	Chen Z, Zhao P, Li C, Li F, Xiang D, Chen YZ, et al. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res. 2021;49:e60.

[51]	Aksoy S, Haralick RM. Feature normalization and likelihood-based similarity measures for image retrieval. Pattern Recogn Lett. 2001;22:563–582.

[52]	Liu H, Setiono R. Chi2: feature selection and discretization of numeric attributes. Proceedings of 7th IEEE international conference on tools with artificial intelligence. IEEE; 1995. p. 388–391.

[53]	Ng AY. Feature selection, L1 vs. L2 regularization, and rotational invariance. Proceedings of the twenty-first international conference on Machine learning. Banff, Alberta, Canada: Association for Computing Machinery; 2004. p.78.

[54]	Liu Z, Song J. Comparison of tree-based feature selection algorithms on biological omics dataset. Proceedings of the 5th international conference on advances in artificial intelligence; virtual event, United Kingdom: Association for Computing Machinery; 2022. p. 165–9.

[55]	Powell A, Bates D, Van Wyk C, de Abreu D. A cross-comparison of feature selection algorithms on multiple cyber security data-sets. FAIR; 2019. p. 196–207.

[56]	Liaw A, Wiener M. Classification and regression by randomForest. R news. 2002;2:18–22.

[57]	Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–357.

[58]	Fuadah YN, Pramudito MA, Lim KM. An optimal approach for heart sound classification using grid search in hyperparameter optimization of machine learning. Bioengineering. 2022;10:45.