Prediction and Machine Learning Analysis of Urban Waterlogging Risks in High-Density Areas From the Perspective of the Built Environment: A Case Study of Shenzhen, China

Shiqi ZHOU; Weiyi JIA; Zhiyu LIU; Mo WANG

doi:10.15302/J-LAF-0-020023

Landsc. Archit. Front. ›› 2024, Vol. 12 ›› Issue (5) : 48 -60. DOI: 10.15302/J-LAF-0-020023

PAPERS

Prediction and Machine Learning Analysis of Urban Waterlogging Risks in High-Density Areas From the Perspective of the Built Environment: A Case Study of Shenzhen, China

Author information +

History +

PDF (3117KB)

Abstract

With the continuous advance of big data and artificial intelligence technologies, various data-driven machine learning algorithms have been widely applied in the studies of urban resilience, particularly in addressing the challenging issue of urban waterlogging. Currently, it is a pressing task to understand the influencing factors of waterlogging from the perspective of built environment, and provide guidance on dynamic monitoring and early alarm services. Focusing on Shenzhen, China, a typical high-density urbanized city, this research constructed a multifactorial dataset encompassing hydrological, meteorological, urban morphology, and waterlogging event data. Then, this research assessed and compared the performance of four mainstream machine learning models—LightGBM, RF, SVR, and BPDNN—in predicting urban waterlogging risks. The results showed that LightGBM had the best accuracy and robustness in predicting waterlogging depths and risk levels in urban areas. The research also employed interpretability algorithm—Shapley Additive Explanations (SHAP)—for decoupling analysis. The results indicated that hydro-meteorological factors (the total rainfall volume and the rainfall lasting time) and several architectural configuration factors (e.g., density of buildings, building congestion degree) are the main influencing factors. In addition, the percentage of water body is vital to waterlogging regulation and retention, especially exhibiting a significant mitigating effect when exceeding 2.5%. This research provides a new technical method for urban waterlogging prediction and reveals the influencing factors and intrinsic mechanisms from the perspective of built environment, which is of great significance for the enhancement of the resilience of high-density cities.

Graphical abstract

Keywords

Urban Waterlogging / Machine Learning / Model Performance Evaluation / Comparative Research / Model Interpretability Analysis / High-Density City

Highlight

	● Proposes a comprehensive research framework combining LightGBM model and the interpretability algorithm of SHAP, and predicts waterlogging depth and its risk level in urban areas
	● Verifies that historical downtowns in high-density cities face higher risks of waterlogging during extreme rainfall events with machine learning methods
	● Presents a novel exploration in high-density urban context that analyzes the influencing factors and intrinsic mechanisms of urban waterlogging, focusing on hydro-meteorological, urban surface, and architectural configuration factor

Cite this article

Download citation ▾

Shiqi ZHOU, Weiyi JIA, Zhiyu LIU, Mo WANG. Prediction and Machine Learning Analysis of Urban Waterlogging Risks in High-Density Areas From the Perspective of the Built Environment: A Case Study of Shenzhen, China. Landsc. Archit. Front., 2024, 12(5): 48-60 DOI:10.15302/J-LAF-0-020023

登录浏览全文

4963

注册一个新账户忘记密码

1 Introduction

One of the core tasks in building urban resilience is accurately and effectively predicting urban risks and their impacts during spatial planning, alongside devising targeted adaptive planning strategies^[¹^]. With the continuous advancement of artificial intelligence technologies, data-driven machine learning techniques have been widely applied in predicting urban waterlogging risks^[²^]^~^[⁵^]. For instance, Elham Rafiei-Sardooi et al. utilized a Support Vector Machine model to map the flood vulnerability of the Khiyav Chai basin in Iran^[³^]; Zhaoli Wang et al. assessed the flood risk of the Dongjiang basin using a Random Forest model^[⁴^]. Compared with traditional hydrological and hydraulic models, the advantage of machine learning models lies in its ability to handle complex high-dimensional data with limited computational resources, especially in analyzing the nonlinear relationships between multivariate factors and target variables^[⁵^]. However, traditional machine learning models still face uncertainties in practice due to issues like overfitting limitation (e.g., only local optimum is supported when large datasets are used) and computational challenges (e.g., difficulty in generating optimal solutions with complex-structured data).

In recent years, a new generation of ensemble machine learning model, surpassing traditional algorithms in robustness, has emerged and been widely adopted in fields such as urban hydrological management^[²^] ^[⁶^]. The study by Hossein Shafizadeh-Moghadam et al. showed that ensemble machine learning models were more accurate and stable than traditional ones in flood susceptibility prediction^[⁷^]; Yuchen Guo et al. also confirmed that ensemble machine learning models significantly outperform the traditional model of Backpropagation Deep Neural Network in flood prediction^[⁸^]; Zening Wu et al. applied the ensemble machine learning model of Gradient Boosting Decision Tree (GBDT) to predict urban flood and waterlogging depths, validating its high accuracy^[⁹^]. Current research often focuses on the practicality of specific machine learning algorithms in urban flood prediction. However, there is less exploration of ensemble machine learning models in multi-scenario urban waterlogging prediction, resulting in a lack of detailed model comparisons and applications in spatial practice. This study aims to conduct a detailed comparative analysis of the ensemble algorithm LightGBM (Light Gradient Boosting Machine)^[¹⁰^] with three traditional machine learning algorithms, namely Random Forests (RF)^[¹¹^], Support Vector Regression (SVR)^[¹²^], and Backpropagation Deep Neural Networks (BPDNN)^[⁸^], to reveal their performance differences in predicting urban waterlogging risks in high-density areas and precisely dissect the influencing factors.

From the perspective of the built environment, this paper proposes a series of suggestions for enhancing urban resilience and provides an innovative method for predicting urban waterlogging risks in high-density areas, offering valuable insights and guidance for future urban planning theoretically and practically.

2 Study Area and Research Methods

This study focused on Shenzhen, a quintessential example of high-density urbanized city, as the case study, taking the city's precipitation events from January 1, 2019 to December 31, 2021 as samples. The research was initiated by constructing a multifactorial dataset for model training and testing, encompassing hydrological, meteorological, urban morphology, and waterlogging event data. The explanatory variables consisted of 3 categories (i.e., hydro-meteorological factors, urban surface factors, and architectural configuration factors) with 21 independent variables, while waterlogging depth being the target variable. The study assessed the performance of the four machine learning models—LightGBM, RF, SVR, and BPDNN—in predicting urban waterlogging risks. Based on the evaluation of model accuracy and robustness, this study selected the optimal model and generated the susceptibility distribution map of urban waterlogging risks in Shenzhen. Building on these predictions, the study then applied shapley additive explanations (SHAP) to conduct an in-depth analysis from global feature importance, feature dependency, and local feature interpretability, offering practical references for decision-making for urban resilience enhancement (Fig.1).

2.1 Study Area

Nestled in the southern part of Guangdong Province along the eastern bank of the Pearl River Estuary, Shenzhen experiences a typical subtropical monsoon climate with prolonged summers, short winters, and copious precipitation. By the end of 2022, the city comprised 9 administrative districts and 1 functional district (Dapeng New District), spanning an area of 1, 997.47 km^2[13^] and homing a permanent population of approximately 17.66 million^[¹⁴^]. Studies have indicated that Shenzhen's urban waterlogging primarily results from the sudden and intense rainstorms in summer^[¹⁵^]. Statistics show that the city's annual rainfall averages 1, 932.9 mm, of which approximately 86% occurs between April and September^[¹⁶^]. In recent years, along with the expansion of built-up areas, substantial alterations in urban spatial structure and surface conditions have progressively encroached upon urban green spaces and water retention areas, intensifying urban waterlogging risks.

2.2 Data Sources and Pre-processing

2.2.1 Data of Urban Waterlogging Depths

The data of urban waterlogging depths between 2019 and 2021 used in this study were sourced from 171 monitoring stations in Shenzhen, collected by curb membrane pressure sensors, with a sampling interval of 1 hour (Fig.2). Taking into account the city's geographical and hydrological characteristics, the study area was divided into 171 sub-catchment units. The original sample data recorded all the rainfall events from January 1, 2019, to December 31, 2021, totaling 26, 305 samples, which were further classified according to a 12-hour lag time (LTIME); then independent rainfall events were extracted from the continuous time series^[¹⁷^] and rainfall events with a total rainfall of less than 1 mm were excluded. After that, 167 samples of rainfall events which significantly impacted on urban waterlogging formed the final sample dataset.

2.2.2 Data of Impact Factors

Existing literature has revealed that hydro-meteorological conditions^[¹⁸^]^~^[²⁰^], urban surface^[⁴^]^[²¹^]^~^[²⁴^], and architectural configurations^[⁴^]^[²⁵^]^~^[²⁷^] primarily affect the occurrence and severity of urban waterlogging. Accordingly, this study selected 21 independent variables, comprising 2 hydro-meteorological factors^[¹⁵^]^[²⁸^], 10 urban surface factors^[²⁴^]^[²⁸^]^~^[³¹^] (Tab.1), and 9 architectural configuration factors^[²⁵^]^[²⁹^]^[³²^]^~^[³⁴^] (Tab.2) as input features for the models. These factors were statistically analyzed for each sub-catchment unit.

2.2.3 Rainfall Scenario Simulation

As global climate change intensifies, cities are likely to experience extreme weather events more frequently in the future, exacerbating urban waterlogging risks. When assessing the waterlogging risk of a given area, it necessitates the simulation and prediction of multiple rainfall scenarios. This study established the one-hour LTIME and recurrence intervals of 1, 2, 3, 5, 10, and 20 years, thereby categorizing the 167 rainfall events into 6 scenarios. The storm intensity was calculated with Shenzhen's storm intensity formula^[³⁵^]:

(1)

q = 8.701 (1 + 0.594) lg ⁡ R (t + 11.13) 0.555,

where q represents the rainfall intensity, t denotes the LTIME, and R stands for the recurrence interval. Chicago hyetograph method that most closely mirrors actual observational conditions^[¹⁸^] ^[³⁶^] was then adopted to simulate the precipitation of different recurrence intervals.

2.3 Research Models and Methods

This study conducted hyperparameter optimization on 4 typical machine learning algorithm models. After training and testing, it utilized common model assessment metrics (R², MAPE, and RMSE) to identify the optimal model for predicting urban waterlogging depths in high-density areas. Additionally, the study used SHAP to analyze global feature importance, feature dependency, and local feature interpretability, so as to identify key factors influencing urban flooding and offer targeted recommendations for mitigation.

2.3.1 Models of Machine Learning Algorithms

(1) LightGBM

The ensemble algorithm LightGBM, developed by Microsoft, is a distributed gradient boosting algorithm built upon the Gradient Boosting Decision Tree (GBDT) framework^[³⁷^], which stands as one of the most efficient machine learning algorithms currently available^[¹⁰^]. It introduces three major enhancements to the traditional GBDT algorithm structure: Gradient-based One-Side Sampling (GOSS), Exclusive Feature Bundling (EFB), and a histogram-based decision-making algorithm^[³⁸^]. On the basis of the given training dataset {(x_i, y_i)}i=1^N (x_i represents the independent variables and y_i denotes the target variables), LightGBM aims to minimize the expected value of a specific loss function. The objective function is defined as follows:

(2)

O b j = ∑ i = 1 n l (y i, y^i) + ∑ k = 1 K Ω (f k),

(3)

Ω (f k) = γ T + 12 λ ‖ ω ‖ 2,

where Obj represents the objective to be minimized, and l(y_i,

y^i

) is the loss function between the actual target value y_i and the predicted value

y^i

. To prevent overfitting, a regularization term is defined as follows: the regularization component Ω(f_k) includes the complexity cost introduced by adding new leaf nodes, where f_k represents the tree model k, γ is the complexity cost of introducing new leaves, T is the number of leaf nodes in the tree, λ is the leaf weight adjustment coefficient, and ω denotes the weight value of the leaves.

(2) RF

RF, an ensemble learning model, was first introduced by Leo Breiman in 2001^[¹¹^]. It is an enhanced version of the bagging decision tree, with the final classification or regression accomplished by a majority voting scheme among all individual decision trees^[³⁹^], making it flexible and user-friendly.

(3) SVR

As a branch of SVM, SVR is primarily employed to address regression problems. Its objective is to minimize the distance between the hyperplane and the farthest sample points so that the data can be fitted by using the hyperplane. This process necessitates the selection of an appropriate kernel function. Studies have shown that the Radial Basis Function (RBF) outperforms other kernels in urban waterlogging prediction^[¹²^] ^[⁴⁰^].

(4) BPDNN

BPDNN is a multilayer feedforward network trained via a backpropagation algorithm based on the error between the output and the desired output. BPDNN can handle nonlinear problems due to its multilayer structure, making it extensively used in the risk assessment of flood and waterlogging hazards.^[⁸^]

2.3.2 Assessment Methods of Model Performance

This study employed the metrics of R², MAPE, and RMSE to assess and compare the performance of the models. R² quantifies the fit of the regression model^[⁴¹^]; MAPE represents the relative error between the predictions and actual values^[⁴²^]; RMSE evaluates the average error between predictions and actual values^[⁴³^]. The mathematical formulas for these metrics are as follows, respectively:

(4)

R 2 (y, y^) = 1 − ∑ i = 1 n (y i − y^i) 2 ∑ i = 1 n (y i − y ¯) 2,

(5)

M A P E (y, y^) = 1 n ∑ i = 1 n | y i − y^i y i |,

(6)

R M S E (y, y^) = 1 n ∑ i = 1 n (y i − y^i) 2,

where y_i refers to the observation i of waterlogging depth,

y^i

represents the corresponding predicted urban waterlogging depth (cm), and n denotes the number of observations.

2.3.3 SHAP Algorithm

The challenge of interpretability, commonly known as the "black box" issue, remains one of the drawbacks of ensemble machine learning^[⁴⁴^]. The SHAP algorithm, based on cooperative game theory, constructs an additive explanation model that estimates the contribution of each feature. Numerous recent studies have started employing SHAP to visually explain complex ensemble models, validating its strong credibility^[⁴⁵^] ^[⁴⁶^]. This study adopted the SHAP algorithm to investigate the global relationships and importance of various features with respect to urban waterlogging depths, while also conducting local feature interpretability analysis, as follows:

(7)

g (z ′) = ϕ 0 + ∑ i = 1 M ϕ i z i ′

where g represents the explanation model, z^'∈{0, 1}^M is the coalition vector, M is the maximum coalition size, and ф_i denotes the contribution of feature i to the model output.

In the SHAP algorithm, the contribution of each feature is determined by their marginal contributions to the output^[⁴⁷^]. This facilitates the interpretation of the machine learning model from both global and local perspectives. Shapley values not only reflect the importance of each feature but also indicate the positive or negative impact on the target variable^[⁴⁸^].

3 Results and Discussion

3.1 Performance Comparison of Algorithm Models

The evaluation of model performance on the training dataset (Tab.3) indicated that LightGBM (R² = 0.96) had a significantly better fit than RF (R² = 0.89), SVR (R² = 0.78), and BPDNN (R² = 0.69). Similarly, LightGBM exhibited the highest fitting accuracy (MAPE = 0.22), with the smallest relative error, exceeded by RF (MAPE = 0.32), SVR (MAPE = 0.44), and BPDNN (MAPE = 0.63). As for RMSE, LightGBM showed the lowest value (RMSE = 1.81 cm), indicating minimal deviation and the best predictive accuracy, followed by RF (RMSE = 2.26 cm), SVR (RMSE = 3.12 cm), and BPDNN (RMSE = 3.64 cm). Overall, the sound precision and stability of LightGBM was demonstrated.

To verify the robustness of the models in predicting urban waterlogging depth, 25% of the rainfall events (42, totaling 1, 790 samples) were used as an independent test dataset based on the algorithm logics of machine learning. The results confirmed that LightGBM consistently outperformed other models in predicting urban waterlogging depth on the test dataset (R² = 0.90, MAPE = 0.28, and RMSE = 2.29 cm).

The results showed that the best-fit lines of LightGBM and RF are closer to the ideal-fit line of "y = x" (Fig.3). Further comparative analysis revealed that although RF performed comparably with LightGBM when urban waterlogging depths were less than 10 cm, LightGBM excelled when the depths grew. Considering the comprehensive metric evaluation and the distribution of absolute error points, it can be concluded that LightGBM significantly surpassed the other three machine learning models in terms of prediction accuracy and robustness. Thus, this study employed LightGBM to predict urban waterlogging depths under different rainfall scenarios in Shenzhen and create corresponding urban waterlogging risk maps.

3.2 Predictions of Urban Waterlogging Depths

Utilizing the natural break method, this study classified the peak waterlogging depth risks in the sub-catchment units into 5 categories: very low (0 ~ 5 cm), low (5 ~ 10 cm), moderate (10 ~ 15 cm), high (15 ~ 20 cm), and very high (≥ 20 cm). The study set the one-hour duration and compared the waterlogging risk variations across different recurrence intervals—1, 2, 3, 5, 10, and 20 years (Fig.4).

As the rainfall recurrence interval increased from 1 year to 10, the proportion of very low risk zones in Shenzhen gradually decreased, while those considered low and medium risk zones correspondingly increased (Fig.4). During this phase, no high and very high risk zones were observed, indicating that despite local waterlogging may occur in the areas with outdated drainage facilities, the majority of Shenzhen's drainage systems can effectively respond to moderate rainfall events. However, when the recurrence interval extended to 20 years, the spatial distribution of urban waterlogging risk significantly changed, particularly with a notable increase of medium risk zones. This revealed that most areas in Shenzhen would experience varying degrees of urban waterlogging during 20-year rainfall events, with high risk zones mainly found in older districts with complex drainage and terrain conditions, such as Nanshan, Futian, and Pingshan Districts.

3.3 Model Interpretability Analyses

3.3.1 Global Feature Importance

Global feature importance quantifies the impact of each feature on prediction outcome within a given model, primarily assessed through the absolute average of each variable's Shapley value^[⁴⁹^]. By calculating all the features (Fig.5), the analysis revealed that among the top 60% of the features, the hydro-meteorological factors, especially TOTAL_R and LTIME, had a profoundly significant impact on the model, each contributed 24.6% and 15.8% of the importance; PW and PIS had importance contributions of 6.4% and 5.1%, and BCD and DB contributed 4.4% and 4.1% of the importance, respectively.

Overall, hydro-meteorological factors significantly outweighed the impact of the other two kinds of factors on urban waterlogging depth, confirming previous studies that identified rainfall amount and duration as the main inducing factors of urban waterlogging^[¹⁸^]^~^[²⁰^]. Meanwhile, PW was the only feature that had a mitigating effect on urban waterlogging. This somehow explains why many cities construct artificial lakes within urban areas—to capture runoffs during extreme rainfall events, alleviating pressure on downstream water bodies and stormwater infrastructure networks.

Among the architectural configuration factors, both BCD and DB describe building density. As these two features increase, there is a corresponding rise in PIS, which leads to a reduction in the stormwater retention capacity of blue-green infrastructure, thereby increasing the likelihood of urban waterlogging.

3.3.2 Feature Dependency

To delve into the nonlinear relationships between urban waterlogging depth and the primary disaster-inducing factors, this study employed feature dependency plots of SHAP to reveal the contribution of individual variables to a given model's output. The feature dependency plot illustrates the extent to which a particular feature modifies the model's prediction, visualizing the marginal effects between feature values and their corresponding Shapley values. The sign of Shapley values (+/–) indicates whether a specific feature has a positive or negative impact on the prediction^[⁵⁰^]. According to the results of the global importance analysis, the six most highly ranked features from the three categories of factors were selected for further analysis.

(1) Hydro-meteorological factors

The results demonstrated a generally positive correlation between TOTAL_R and urban waterlogging depth (Fig.6). When TOTAL_R was below 25 mm, it slightly impacted on urban waterlogging. However, as TOTAL_R exceeded 25 mm, its impact manifested by three stages: initially, urban waterlogging risk escalated as the total rainfall increased from 25 mm to 100 mm, with the average Shapley value rising from 2 to approximately 6; the impact of rainfall stabilized when TOTAL_R was between 100 mm and 125 mm, indicating a marginal effect that additional increases in rainfall do not exacerbate urban waterlogging; when the rainfall surpassed 125 mm, the average Shapley value climbed swiftly again, and urban waterlogging risk kept increasing, possibly due to the limitation of the urban drainage system's capacity. Although local variations in urban environment and infrastructure might affect their sensitivity to rainfall, in general, the strong influence of rainfall on urban waterlogging risk is consistent across nearly all regions.

The dependency plot of Shapley value for LTIME (Fig.6) showed that when the duration of continuous rainfall was less than 55 hours, LTIME had a fluctuating impact on the model's output. Although prolonged LTIME exacerbated urban waterlogging risk, it did not exhibit a linear growth effect similar to that of TOTAL_R. However, once the duration exceeded 55 hours, the Shapley values surged dramatically. The study suggests that for LTIME less than 55 hours, the hourly rainfall of most events did not reach the disaster-inducing threshold, thus exerting a less intense pressure on urban waterlogging depth.

(2) Urban surface factors

The feature dependency plot for urban surface factors indicated that PW can lower the urban waterlogging risks (Fig.7): the higher proportion of water bodies correlated with stronger mitigation of urban waterlogging; after reaching a certain threshold (12.5%), it exhibited no further impacts on waterlogging. When PW was below 1.2%, the stormwater retention capacity of water surfaces was minimal. However, once the proportion exceeded 2.5%, the mitigation effect on urban waterlogging gradually augmented. According to the research by Wenchao Qi et al., urban lakes can provide substantial buffer areas for runoff during flood seasons, thereby facilitating urban waterlogging management^[⁵¹^]. Hence, preserving natural lakes and strategically constructing urban artificial lakes are beneficial for stormwater management.

The impact of PIS on urban waterlogging can be unfolded in three stages (Fig.7): 1) shifting from 0 to 15%, the Shapley values remained close to zero, suggesting a slight impact; 2) the impact gradually intensified once exceeding the 15% threshold and peaked at 30%^①; and 3) there observed no further rise if beyond 30%. Relevant research indicated that the spatial configuration of impervious surfaces significantly affects urban waterlogging. Reducing PIS can provide sufficient space for vegetation to intercept and for soil to absorb rainwater, thereby lowering peak runoff and mitigating urban waterlogging^[⁵²^].

① When PIS exceeded 30%, the impacts on urban waterlogging were not strengthened. Thus it mainly presents the range of 0 to 30% in Fig.7.

(3) Architectural configuration factors

In terms of architectural configuration, BCD had an impact on urban waterlogging depth (Fig.8), with a noticeable increase in average Shapley value within [0.01, 0.02]. Examination of sample data revealed that these sub-catchment units were primarily covered by hard paving; despite their low DB, the extensive impervious surfaces impeded effective infiltration and retention of rainfall, leading to an increase of transient runoff and thus triggering urban waterlogging. Therefore, the proportion of impervious pavement should be strictly controlled in urban design or renewal practice. A more significant compressive effect on urban waterlogging was observed when BCD exceeded the threshold of 0.08. Meanwhile, DB exhibited clear spatial heterogeneity in its impact on urban waterlogging (Fig.8). Even with the same DB conditions, differences in building layout, height, and morphology can lead to variations of Shapley values. When DB is below 15 buildings per hectare, Shapley values fluctuated between [– 0.5, 1.0]; while above this DB threshold, Shapley values ranged [0.0, 2.0], indicating a significant increase of urban waterlogging risk.

3.3.3 Local Feature Interpretability

The contribution analysis of different features to the model output for each sub-catchment units can provide an in-depth understanding of the impact of spatial heterogeneity on urban waterlogging, and inform the development of adaptive strategies. Based on previous predictions (Fig.4), two sub-catchment units of high waterlogging risk in Shenzhen were selected for individual sample feature analysis, where unit A represents a high-density historical downtown of Shenzhen and unit B represents a newly developed coastal area of the city (Fig.9)^②.

② Since the contribution value of the features at a lower rank is too small and not typical for further discussion, the local feature interpretability analysis focused on the top 7 contributing features.

For unit A, the features of TOTAL_R, BCR, PIS, and MBV accounted for the top 80% of the contribution. Consistent with the results of previous global feature importance analysis, TOTAL_R was the most significant factor affecting urban waterlogging risk in the unit; the other three were architectural configuration factors. Locating in the old city area, unit A had an extreme high DB with complex road system and a large coverage of impervious surfaces, aggravating substantial runoff during extreme rainfall events.

In unit B, the features contributing to the top 80% of the importance were TOTAL_R, MBV, BCR, PIS, and LTIME. Additionally, possibly due to the differences in PGS, comparing to unit A, the contribution of PGS obviously increased. This could be attributed to the location of unit B in the coastal commercial and high-end residential areas, where, despite the densely distributed buildings, coastal greenbelts, wetland parks, and roadside green space have together highlighted the contribution of PGS in the model, exerting a mitigating effect on urban waterlogging.

4 Conclusions and Perspectives

Based on the observed data of urban waterlogging records in Shenzhen between 2019 and 2021, this study integrated hydro-meteorological, urban surface, and architectural configuration factors to predict urban waterlogging risk using four machine learning models—LightGBM, RF, SVR, and BPDNN. By comparing the performance of the models, LightGBM was selected as the optimal predictive model for urban waterlogging risk assessment, leading to the following conclusions.

When experiencing high-recurrence rainfall events (e.g., once every 20 years), the high risk zones in Shenzhen will emerge predominantly in the old city districts of Nanshan, Futian, and Pingshan. Feature analysis of the models indicated that hydro-meteorological factors (including TOTAL_R and LTIME) were the primary disaster-inducing elements, contributing 40.4% to the model. The impact on urban waterlogging became particularly pronounced when TOTAL_R exceeded 125 mm or LTIME exceeds 55 hours. PW was the only feature showing a mitigating effect on urban waterlogging, and when it exceeds 2.5% the stormwater regulation and retention capacity enhanced. Features reflecting building density exhibited significantly positive correlation with urban waterlogging risk once exceeding certain thresholds. Notably, the interpretability of urban surface factors like ARF was relatively low, which may be related to the minor topographical changes and high imperviousness in Shenzhen's built-up areas. TOTAL_R and DB remained the primary features locally affecting urban waterlogging depth, while in certain regions, blue-green infrastructure played a crucial role in mitigating urban waterlogging.

The LightGBM-based method for predicting urban waterlogging risk proposed and validated in this study is of universal significance. The analysis of critical factors affecting urban waterlogging through interpretability algorithms can provide guidance for urban planning and construction. In the development of high-density urban areas, it is necessary to strengthen the renovation of old neighborhoods, restore natural ecosystems, promptly improve the capacities of drainage and flood prevention infrastructure. It also needs to restore and expand water bodies for natural regulation space expansion in and around cities, and to construct flood storage and security engineering projects according to relevant standards and plans. Furthermore, it is necessary to create more open green spaces integrating spatial and vertical design in urban construction and renewal; adaptively increase the proportion of permeable pavement; and promote the construction of sponge cities to preserve natural rainwater and flood channels and storage spaces including rivers, lakes, and wetlands, establishing a comprehensive ecological infrastructure system.

Due to the constraints in acquiring urban pipeline network data^[²⁸^], this study employed road factor (PR) as a proxy. This can reflect the efficiency of urban drainage to a certain degree, but there are still some limitations. Utilizing data with higher spatial resolution would reveal urban waterlogging dynamics with greater details and increase the precision of the identification and analysis of the key factors. Future studies could also integrate hydrological and hydraulic models for more targeted and accurate experiments.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Jha, A. K., Miner, T. W., & Stanton-Geddes, Z. (Eds.). (2013). Building Urban Resilience: Principles, Tools, and Practice. The World Bank.

[2]	Arabameri, A. , Saha, S. , Chen, W. , Roy, J. , Pradhan, B. , & Bui, D. T. (2020) Flash flood susceptibility modelling using functional tree and hybrid ensemble techniques. Journal of Hydrology, ( 587), 125007.

[3]	Rafiei-Sardooi, E. , Azareh, A. , Choubin, B. , Mosavi, A. H. , & Clague, J. J. (2021) Evaluating urban flood risk using hybrid method of TOPSIS and machine learning. International Journal of Disaster Risk Reduction, ( 66), 102614.

[4]	Wang, Z. , Lai, C. , Chen, X. , Yang, B. , Zhao, S. , & Bai, X. (2015) Flood hazard risk assessment model based on random forest. Journal of Hydrology, ( 527), 1130– 1141.

[5]	Chen, J. , Huang, G. , & Chen, W. (2021) Towards better flood risk management: Assessing flood risk and investigating the potential mechanism based on machine learning models. Journal of Environmental Management, ( 293), 112810.

[6]	Mei, C. , Liu, J. , Wang, H. , Yang, Z. , Ding, X. , & Shao, W. (2018) Integrated assessments of green infrastructure for flood mitigation to support robust decision-making for sponge city construction in an urbanized watershed. Science of the Total Environment, ( 639), 1394– 1407.

[7]	Shafizadeh-Moghadam, H. , Valavi, R. , Shahabi, H. , Chapi, K. , & Shirzadi, A. (2018) Novel forecasting approaches using combination of machine learning and statistical models for flood susceptibility mapping. Journal of Environmental Management, ( 217), 1– 11.

[8]	Guo, Y. , Quan, L. , Song, L. , & Liang, H. (2022) Construction of rapid early warning and comprehensive analysis models for urban waterlogging based on AutoML and comparison of the other three machine learning algorithms. Journal of Hydrology, ( 605), 127367.

[9]	Wu, Z. , Zhou, Y. , Wang, H. , & Jiang, Z. (2020) Depth prediction of urban flood under different rainfall return periods based on deep learning and data warehouse. Science of the Total Environment, ( 716), 137077.

[10]	Gan, M. , Pan, S. , Chen, Y. , Cheng, C. , Pan, H. , & Zhu, X. (2021) Application of the machine learning LightGBM model to the prediction of the water levels of the lower Columbia River. Journal of Marine Science and Engineering, 9 ( 5), 496.

[11]	Breiman, L. (2001) Random forests. Machine Learning, ( 45), 5– 32.

[12]	Panahi, M. , Dodangeh, E. , Rezaie, F. , Khosravi, K. , Van Le, H. , Lee, M.-J. , Lee, S. , & Pham, B. T. (2021) Flood spatial prediction modeling using a hybrid of meta-optimization and support vector regression modeling. CATENA, ( 199), 105114.

[13]	Community Construction and Zoning Office, Bureau of Civil Affairs of Shenzhen Municipality. (2024, April 3). Overview of administrative division information.

[14]	Statistics Bureau of Shenzhen Municipality. (2023, May 8). Shenzhen statistical bulletin on 2022 national economic and social development.

[15]	Zhou, S. , Liu, Z. , Wang, M. , Gan, W. , Zhao, Z. , & Wu, Z. (2022) Impacts of building configurations on urban stormwater management at a block scale using XGBoost. Sustainable Cities and Society, ( 87), 104235.

[16]	Meteorological Bureau of Shenzhen Municipality. (2024, May 15). Climatic profile and seasonal characteristics of Shenzhen.

[17]	Ke, Q. , Tian, X. , Bricker, J. , Tian, Z. , Guan, G. , Cai, H. , Huang, X. , Yang, H. , & Liu, J. (2020) Urban pluvial flooding prediction by machine learning approaches—A case study of Shenzhen City, China. Advances in Water Resources, ( 145), 103719.

[18]	Hou, J. , Guo, K. , Wang, Z. , Jing, H. , & Li, D. (2017) Numerical simulation of design storm pattern effects on urban flood inundation. Advances in Water Science, 28 ( 6), 820– 828.

[19]	Zhou, H. , Liu, J. , Gao, C. , & Ou, S. (2018) Analysis of current situation and problems of urban waterlogging control in China. Journal of Catastrophology, 33 ( 3), 147– 151.

[20]	Song, L. , & Xu, Z. (2019) Coupled hydrologic-hydrodynamic model for urban rainstorm water logging simulation: Recent advances. Journal of Beijing Normal University (Natural Science), 55 ( 5), 581– 587.

[21]	Wu, J. , & Zhang, P. (2017) The effect of urban landscape pattern on urban waterlogging. Acta Geographica Sinica, 72 ( 3), 444– 456.

[22]	Xu, Y. , Li, K. , Xie, Y. , Ling, H. , Qian, M. , Wang, X. , & Lu, Y. (2018) Study on the influencing factors and multiple regression model of urban waterlogging based on GIS—A case study of Shanghai. Journal of Fudan University (Natural Science), 57 ( 2), 182– 198.

[23]	Xu, H. , Lu, H. , Zhan, X. , Li, J. , Gao, C. , & Zhang, T. (2024) Impacts of underlying surface changes and rainfall patterns on flooding at airport area in Zhuhai. China Rural Water and Hydropower, , 1– 16.

[24]	Shrestha, R. , Di, L. , Eugene, G. Y. , Kang, L. , Shao, Y.-Z. , & Bai, Y.-Q. (2017) Regression model to estimate flood impact on corn yield using MODIS NDVI and USDA cropland data layer. Journal of Integrative Agriculture, 16 ( 2), 398– 407.

[25]	Lin, J. , He, X. , Lu, S. , Liu, D. , & He, P. (2021) Investigating the influence of three-dimensional building configuration on urban pluvial flooding using random forest algorithm. Environmental Research, ( 196), 110438.

[26]	Kim, Y. , Eisenberg, D. A. , Bondank, E. N. , Chester, M. V. , Mascaro, G. , & Underwood, B. S. (2017) Fail-safe and safe-to-fail adaptation: Decision-making for urban flooding under climate change. Climatic Change, ( 145), 397– 412.

[27]	Wang, J. , Yu, C. W. , & Cao, S.-J. (2022) Urban development in the context of extreme flooding events. Indoor and Built Environment, 31 ( 1), 3– 6.

[28]	Wang, M. , Li, Y. , Yuan, H. , Zhou, S. , Wang, Y. , Ikram, R. M. A. , & Li, J. (2023) An XGBoost-SHAP approach to quantifying morphological impact on urban flooding susceptibility. Ecological Indicators, ( 156), 111137.

[29]	Yan, M. , Yang, J. , Ni, X. , Liu, K. , Wang, Y. , & Xu, F. (2024) Urban waterlogging susceptibility assessment based on hybrid ensemble machine learning models: A case study in the metropolitan area in Beijing, China. Journal of Hydrology, ( 630), 130695.

[30]	Zhang, H. , Zhang, J. , Fang, H. , & Yang, F. (2022) Urban flooding response to rainstorm scenarios under different return period types. Sustainable Cities and Society, ( 87), 104184.

[31]	Kumar, R. , & Acharya, P. (2016) Flood hazard and risk assessment of 2014 floods in Kashmir Valley: A space-based multisensor approach. Natural Hazards, ( 84), 437– 464.

[32]	Jiang, F. , Xie, Z. , Xu, J. , Yang, S. , Zheng, D. , Liang, Y. , Hou, Z. , & Wang, J. (2023) Spatial and component analysis of urban flood resiliency of Kunming City in China. International Journal of Disaster Risk Reduction, ( 93), 103759.

[33]	Xu, Y. , Liu, M. , Hu, Y. , Li, C. , & Xiong, Z. (2019) Analysis of three-dimensional space expansion characteristics in old industrial area renewal using GIS and Barista: A case study of Tiexi District, Shenyang, China. Sustainability, 11 ( 7), 1860.

[34]	Cheng, C. , Yu, X. , Guo, S. , & Ma, T. (2005) Analysis of the crowd degree of building for communities based on high spatial resolution remote sensed images. Acta Scientiarum Naturalium Universitatis Pekinensis, 41 ( 6), 875– 881.

[35]	Meteorological Bureau of Shenzhen Municipality. (2023, March 30). New version of the storm intensity formula.

[36]	Dai, Y. , Wang, Z. , Dai, L. , Cao, Q. , & Wang, T. (2017) Application of Chicago Hyetograph Method in design of short duration rainstorm patterns. Journal of Arid Meteorology, 35 ( 6), 1061– 1069.

[37]	Wu, Z. , Qiao, R. , Zhao, S. , Liu, X. , Gao, S. , Liu, Z. , Ao, X. , Zhou, S. , Wang, Z. , & Jiang, Q. (2022) Nonlinear forces in urban thermal environment using Bayesian optimization-based ensemble learning. Science of the Total Environment, ( 838), 156348.

[38]

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In: U. von Luxburg, I. Guyon, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 30. Neural Information Processing Systems Foundation, Inc.

[39]	Belgiu, M. , & Drăguţ, L. (2016) Random forest in remote sensing: A review of applications and future directions. ISPRS Journal of Photogrammetry and Remote Sensing, ( 114), 24– 31.

[40]	Wu, J. , Liu, H. , Wei, G. , Song, T. , Zhang, C. , & Zhou, H. (2019) Flash flood forecasting using support vector regression model in a small mountainous catchment. Water, 11 ( 7), 1327.

[41]	Wu, J. , Liu, Z. , Liu, T. , Liu, W. , Liu, W. , & Luo, H. (2023) Assessing urban pluvial waterlogging resilience based on sewer congestion risk and climate change impacts. Journal of Hydrology, ( 626), 130230.

[42]	Goodwin, P. , & Lawton, R. (1999) On the asymmetry of the symmetric MAPE. International Journal of Forecasting, 15 ( 4), 405– 408.

[43]	Hodson, T. O. (2022) Root mean square error (RMSE) or mean absolute error (MAE): When to use them or not. Geoscientific Model Development, 15 ( 14), 5481– 5487.

[44]	Hassija, V. , Chamola, V. , Mahapatra, A. , Singal, A. , Goel, D. , Huang, K. , Scardapane, S. , Spinelli, I. , Mahmud, M. , & Hussain, A. (2023) Interpreting black-box models: A review on explainable artificial intelligence. Cognitive Computation, ( 16), 45– 74.

[45]	Li, Z. (2022) Extracting spatial effects from machine learning model using local interpretation method: An example of SHAP and XGBoost. Computers, Environment and Urban Systems, ( 96), 101845.

[46]	Parsa, A. B. , Movahedi, A. , Taghipour, H. , Derrible, S. , & Mohammadian, A. K. (2020) Toward safer highways, application of XGBoost and SHAP for real-time accident detection and feature analysis. Accident Analysis Prevention, ( 136), 105405.

[47]	Van den Broeck, G. , Lykov, A. , Schleich, M. , & Suciu, D. (2022) On the tractability of SHAP explanations. Journal of Artificial Intelligence Research, ( 74), 851– 886.

[48]	Molnar, C. (2020). Interpretable machine learning.

[49]	Casalicchio, G., Molnar, C., & Bischl, B. (2019). Visualizing the Feature Importance for Black Box Models. Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018, Dublin, Ireland, September 10–14, 2018, Proceedings, Part I (pp. 655–670). Springer.

[50]	Michiels, J. , Suykens, J. , & De Vos, M. (2024) Explaining the model and feature dependencies by decomposition of the Shapley value. Decision Support Systems, ( 182), 114234.

[51]	Qi, W. , Hou, J. , Liu, J. , Han, H. , Guo, K. , & Ma, Y. (2018) Lake control on surface runoff causing urban flood inundation. Journal of Hydroelectric Engineering, 37 ( 9), 8– 18.

[52]	Poelmans, L. , Van Rompaey, A. , & Batelaan, O. (2010) Coupling urban expansion models and hydrological models: How important are spatial patterns?. Land Use Policy, 27 ( 3), 965– 975.