Transfer learning enables predictions in soil-borne diseases
Lei Xin, Penghao Xie, Tao Wen, Guoqing Niu, Jun Yuan
Transfer learning enables predictions in soil-borne diseases
● The Transformer model precisely predicts soil health status from high-throughput sequencing data.
● The SMOTE algorithm addresses data imbalance issues, improving model accuracy.
● Transfer learning validates the model on small samples, strengthening its generalization capabilities.
Inhibiting the occurrence of soil-borne diseases is considered as the most favorable approach for promoting sustainable agricultural development. Constructing soil disease prediction models can serve precision agriculture. However, the analysis results of the meta-framework often contradict each other, causing inconsistency in the important features of machine learning results. Therefore, it is necessary to compare the classification accuracy of various machine learning models and further optimize the features of the models to enhance their classification accuracy. Here, we conducted a comparison of eight common machine learning algorithms (XGBoost, CatBoost, Decision Tree, LGBM, Naïve Byes, Perceptron, Logistic, and Random Forest) at the levels of family, genus, and class. The important features of the model were extracted based on the differences in model accuracy and important features, followed by an interpretable analysis of these important features using feature importance. Subsequently, the data underwent resampling using the SMOTE algorithm, and the results show that the SMOTE-Transformer model performs well, surpassing the training results of the voting and stacking strategies, with an accuracy reaching 90%. We have also deployed the SMOTE-Transformer model on sequencing data, which has an accuracy of over 80%. The construction of SMOTE-Transformer model provides a new idea for soil microbial data analysis by greatly improving the accuracy and robustness of soil microbial data processing tools.
soil disease / feature importance / heterogeneous integration strategy / transfer learning
[1] |
Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., 2006. Greedy layer-wise training of deep networks. In: Proceedings of the 19th International Conference on Neural Information Processing Systems. British Columbia: MIT Press, 153–160.
|
[2] |
Breiman, L., 1996. Bagging predictors. Machine Learning24, 123–140.
|
[3] |
Chang, H.X., Haudenshield, J.S., Bowen, C.R., Hartman, G.L., 2017. Metagenome-wide association study and machine learning prediction of bulk soil microbiome and crop productivity. Frontiers in Microbiology8, 519.
|
[4] |
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P., 2002. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research16, 321–357.
CrossRef
Google scholar
|
[5] |
Clauwaert, J., McVey, Z., Gupta, R., Menschaert, G., 2023. TIS Transformer: remapping the human proteome using deep learning. NAR Genomics and Bioinformatics5, lqad021.
CrossRef
Google scholar
|
[6] |
Del Vento, D., Fanfarillo, A., 2019. Traps, pitfalls and misconceptions of machine learning applied to scientific disciplines. In: Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (Learning). Chicago: ACM, 75.
|
[7] |
DeLucia, E.H., Hamilton, J.G., Naidu, S.L., Thomas, R.B., Andrews, J.A., Finzi, A., Lavine, M., Matamala, R., Mohan, J.E., Hendrey, G.R., Schlesinger, W.H., 1999. Net primary production of a forest ecosystem with experimental CO2 enrichment. Science284, 1177–1179.
CrossRef
Google scholar
|
[8] |
Denny, Y.R., Permata, E., Assaat, L.D., 2022. Classification of diseases of banana plant fusarium wilted banana leaf using support vector machine. Gravity: Jurnal Ilmiah Penelitian dan Pembelajaran Fisika8, 57–69.
|
[9] |
Fung, D.L.X., Li, X., Leung, C.K., Hu, P.Z., 2023. A self-knowledge distillation-driven CNN-LSTM model for predicting disease outcomes using longitudinal microbiome data. Bioinformatics Advances3, vbad059.
CrossRef
Google scholar
|
[10] |
Gao, Y., Cui, Y., 2020. Deep transfer learning for reducing health care disparities arising from biomedical data inequality. Nature Communications11, 5131.
CrossRef
Google scholar
|
[11] |
Gentile, C.L., Weir, T.L., 2018. The gut microbiota at the intersection of diet and human health. Science362, 776–780.
CrossRef
Google scholar
|
[12] |
Gordon, T.R., 2017. Fusarium oxysporum and the Fusarium wilt syndrome. Annual Review of Phytopathology55, 23–39.
CrossRef
Google scholar
|
[13] |
Guo, H., Wang, T., Louie, P.K.K., 2004. Source apportionment of ambient non-methane hydrocarbons in Hong Kong: application of a principal component analysis/absolute principal component scores (PCA/APCS) receptor model. Environmental Pollution129, 489–498.
CrossRef
Google scholar
|
[14] |
Harikrishnan, R., del Río, L.E., 2008. A logistic regression model for predicting risk of white mold incidence on dry bean in North Dakota. Plant Disease92, 42–46.
CrossRef
Google scholar
|
[15] |
Hayward, A.C., 1991. Biology and epidemiology of bacterial wilt caused by Pseudomonas solanacearum. Annual Review of Phytopathology29, 65–87.
CrossRef
Google scholar
|
[16] |
Hu, C., Qi, Y.C., 2013. Long-term effective microorganisms application promote growth and increase yields and nutrition of wheat in China. European Journal of Agronomy46, 63–67.
CrossRef
Google scholar
|
[17] |
Ioannidis, J.P.A., 2016. The mass production of redundant, misleading, and conflicted systematic reviews and meta‐analyses. The Milbank Quarterly94, 485–514.
CrossRef
Google scholar
|
[18] |
Jansson, J.K., Hofmockel, K.S., 2020. Soil microbiomes and climate change. Nature Reviews Microbiology18, 35–46.
CrossRef
Google scholar
|
[19] |
Jiang, G.F., Zhang, J.X., Zhang, Y.Z., Yang, X.R., Li, T.T., Wang, N.Q., Chen, X.J., Zhao, F.J., Wei, Z., Xu, Y.C., Shen, Q.R., Xue, W., 2023. DCiPatho: deep cross-fusion networks for genome scale identification of pathogens. Briefings in Bioinformatics24, bbad194.
CrossRef
Google scholar
|
[20] |
Li, J.G., Ren, G.D., Jia, Z.J., Dong, Y.H., 2014. Composition and activity of rhizosphere microbial communities associated with healthy and diseased greenhouse tomatoes. Plant and Soil380, 337–347.
CrossRef
Google scholar
|
[21] |
Li, Q.L., Zhu, Y.H., Shangguan, W., Wang, X.Z., Li, L., Yu, F.H., 2022. An attention-aware LSTM model for soil moisture and soil temperature prediction. Geoderma409, 115651.
CrossRef
Google scholar
|
[22] |
Liu, J.W., Kang, H., Tao, W.D., Li, H.Y., He, D., Ma, L.X., Tang, H.J. Wu, S.Q., Yang, K.X., Li, X.X., 2023. A spatial distribution–Principal component analysis (SD-PCA) model to assess pollution of heavy metals in soil. Science of the Total Environment859, 160112.
CrossRef
Google scholar
|
[23] |
Nicholson, J.K., Wilson, I.D., 2003. Understanding ‘global’ systems biology: metabonomics and the continuum of metabolism. Nature Reviews Drug Discovery2, 668–676.
CrossRef
Google scholar
|
[24] |
Olson, R.S., La Cava, W., Orzechowski, P., Urbanowicz, R.J., Moore, J.H., 2017. PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining10, 36.
CrossRef
Google scholar
|
[25] |
Pavlyshenko, B., 2018. Using stacking approaches for machine learning models. In: Proceedings of 2018 IEEE Second International Conference on Data Stream Mining & Processing. Lviv: IEEE, 255–258.
|
[26] |
Penesyan, A., Kjelleberg, S., Egan, S., 2010. Development of novel drugs from marine surface associated microorganisms. Marine Drugs8, 438–459.
CrossRef
Google scholar
|
[27] |
Qiao, Y.Y., 2021. Screening of microbial indexes for soil health assessment in wheat area of Zhejiang province. Master Degree Thesis. Northwest A&F University, Yangling.
|
[28] |
Schapire, R.E., 2003. The boosting approach to machine learning: an overview. In: Denison, D.D., Hansen, M.H., Holmes, C.C., Mallick, B., Yu, B., eds. Nonlinear Estimation and Classification. New York: Springer, 149–171.
|
[29] |
Schulz-Trieglaff, O., Machtejevas, E., Reinert, K., Schlüter, H., Thiemann, J., Unger, K., 2009. Statistical quality assessment and outlier detection for liquid chromatography-mass spectrometry experiments. BioData Mining2, 4.
CrossRef
Google scholar
|
[30] |
Sherstinsky, A., 2020. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena404, 132306.
CrossRef
Google scholar
|
[31] |
Sokol, N.W., Slessarev, E., Marschmann, G.L., Nicolas, A., Blazewicz, S.J., Brodie, E.L., Firestone, M.K., Foley, M.M., Hestrin, R., Hungate, B.A., Koch, B.J., Stone, B.W., Sullivan, M.B., Zablocki, O., Pett-Ridge, J., 2022. Life and death in the soil microbiome: how ecological processes influence biogeochemistry. Nature Reviews Microbiology20, 415–430.
CrossRef
Google scholar
|
[32] |
Theodoris, C.V., Xiao, L., Chopra, A., Chaffin, M.D., Al Sayed, Z.R., Hill, M.C., Mantineo, H., Brydon, E.M., Zeng, Z.X., Liu, X.S., Ellinor, P.T., 2023. Transfer learning enables predictions in network biology. Nature618, 616–624.
CrossRef
Google scholar
|
[33] |
Trivedi, P., Delgado-Baquerizo, M., Trivedi, C., Hamonts, K., Anderson, I.C., Singh, B.K., 2017. Keystone microbial taxa regulate the invasion of a fungal pathogen in agro-ecosystems. Soil Biology and Biochemistry111, 10–14.
CrossRef
Google scholar
|
[34] |
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 6000–6010.
|
[35] |
Wen, T., Ding, Z.X., Thomashow, L.S., Hale, L., Yang, S.D., Xie, P.H., Liu, X.Y., Wang, H.Q., Shen, Q.R., Yuan, J., 2023a. Deciphering the mechanism of fungal pathogen-induced disease-suppressive soil. New Phytologist238, 2634–2650.
CrossRef
Google scholar
|
[36] |
Wen, T., Niu, G.Q., Chen, T., Shen, Q.R., Yuan, J., Liu, Y.X., 2023b. The best practice for microbiome analysis using R. Protein & Cell14, 713–725.
|
[37] |
Wen, T., Xie, P.H., Penton, C.R., Hale, L., Thomashow, L.S., Yang, S.D., Ding, Z.X., Su, Y.Q., Yuan, J., Shen, Q.R., 2022. Specific metabolites drive the deterministic assembly of diseased rhizosphere microbiome through weakening microbial degradation of autotoxin. Microbiome10, 177.
CrossRef
Google scholar
|
[38] |
Wheeler, T., Von Braun, J., 2013. Climate change impacts on global food security. Science341, 508–513.
CrossRef
Google scholar
|
[39] |
Ye, X.F., Li, Z.K., Luo, X., Wang, W.H., Li, Y.K., Li, R., Zhang, B., Qiao, Y., Zhou, J., Fan, J.Q., Wang, H., Huang, Y., Cao, H., Cui, Z.L., Zhang, R.F., 2020. A predatory myxobacterium controls cucumber Fusarium wilt by regulating the soil microbial community. Microbiome8, 49.
CrossRef
Google scholar
|
[40] |
Yuan, J., Wen, T., Zhang, H., Zhao, M.L., Penton, C.R., Thomashow, L.S., Shen, Q.R., 2020. Predicting disease occurrence with high accuracy based on soil macroecological patterns of Fusarium wilt. The ISME Journal14, 2936–2950.
CrossRef
Google scholar
|
[41] |
Zhang, H., Cheng, S.Q., Li, H.F., Fu, K., Xu, Y., 2020. Groundwater pollution source identification and apportionment using PMF and PCA-APCA-MLR receptor models in a typical mixed land-use area in Southwestern China. Science of the Total Environment741, 140383.
CrossRef
Google scholar
|
/
〈 | 〉 |