Machine learning prediction models are increasingly used in solid waste management systems due to their high accuracy and ability to obtain new complex data and mine them in depth (
Noori et al., 2010;
Shahabi et al., 2012;
Abbasi and El Hanandeh, 2016; Kontokosta etal., 2018). In addition, machine learning models can be broadly applied to short-, medium- and long-term predictions for MSW generation (
Zade and Noori, 2008;
Ali Abdoli et al., 2012;
Abbasi et al., 2013). Machine learning algorithms such as artificial neural network (ANN) (
Noori et al., 2010;
Azadi and Karimi-Jashni, 2016), support vector machine (SVM) (
Abbasi and El Hanandeh, 2016) and gradient boost regression tree (GBRT) (
Johnson et al., 2017;
Kontokosta et al., 2018) have been adopted for MSW generation forecasting. Relative to other algorithms, GBRT shows the following advantages. First, various types of data can be flexibly processed, including continuous values and discrete values. Second, in the case of relatively short tuning time, the accuracy of prediction can be relatively high. Third, the usage of some robust loss functions can be robust to outliers. The accuracy and practicability of model prediction are often conditioned by the selection and identification of feature variables (
Ordóñez-Ponce et al., 2006;
Adeogba et al., 2019). While a model simulation in Vietnam obtained an
R2 value > 0.96, that study merely used 63 detailed data sets to conduct machine learning and geographic distribution (
Nguyen et al., 2021). Leave-one-out or K-fold cross-validation can improve model accuracy especially for small data analysis. Cross-validation is a method of model selection, using part of the data set to test the model validity. However, only 12% of studies in a recent review on ANN studies have applied this method indicating its importance needs further attention (
Xu et al., 2021). Extensive and comprehensive feature variables can further improve the model accuracy (
Sun and Chungpaibulpatana, 2017). However, few studies have established the MSW generation model through multi-level feature variables (e.g., socioeconomic factors, natural conditions and internal conditions). Less than 10% of the published works on machine learning contained more than 1000 data in a report (
Xu et al., 2021). In addition, in existing research, small scale data collection for most models aimed at the city level (
Noori et al., 2010;
Abbasi et al., 2014;
Abbasi and El Hanandeh, 2016;
Azadi and Karimi-Jashni, 2016;
Johnson et al., 2017;
Kannangara et al., 2018;
Kontokosta et al., 2018;
Wu et al., 2020), which limited the broad applicability of the model to a certain extent. Therefore, it is critical to develop a high-accuracy model based on large-scale data collection and wide range of influence variables that can be broadly applied to the prediction of MSW production.