1 Introduction
Urban green spaces—natural or semi-natural land uses within cities—are essential components of landscapes, offering residents a broad range of ecosystem services and opportunities for connection with nature and recreational activities
[1]. Visual perception is one of the significant ways through which people perceive the environment
[2], and assessing the visual quality of urban landscapes—also known as visual landscape assessment—constitutes a major topic of landscape research
[3]. Such assessments provide valuable insights for researchers and governments into urban landscape quality. Traditional visual landscape assessment methods primarily employ scenic beauty estimation
[4] and questionnaire surveys
[5]. Although these methods effectively collect people's preferences on specific landscapes, they still have numerous limitations: they heavily rely on costly manual judgments from experts or respondents regarding images, making the process labor and resource-intensive, complex to implement, and limited in data sources
[6] [7]. Furthermore, the complexity of real-world landscapes often makes it challenging to apply findings from generalized studies into new scenarios.
The recent development of artificial intelligence (AI) technologies offers solutions to address these issues. AI has demonstrated tremendous potential in research on intelligent built environments and is widely recognized as highly promising in the fields of sustainable smart cities and landscape planning
[8] [9]. Among these advancements, street view images (SVIs) have emerged as a new form of crowd-sourced data that offers a realistic depiction of urban environments and supports feedback based on genuine human perceptions, which serves as a high-quality data source for evaluating the visual quality of urban built environments
[6] [10]. Related research spans five key areas: landscape design and environmental assessment, thermal environment, neighborhood morphology, environmental perception of neighborhood, and socioeconomic factor analysis
[7].
With the increasing availability of satellites and coverage of street view services, satellite imagery and SVIs have become vital data sources for understanding large-scale urban landscapes. However, these sources also have certain limitations. For instance, satellite imagery fails to reflect the human-eye perception. Currently, most green spaces, communities, and educational institutions in many Chinese cities are not accessible to map service providers for image collection; certain roads also lack street view service coverage. Additionally, the lack of timely updates is a major shortcoming of publicly available data
[11]. Consequently, scholars attempt to use wearable cameras, drones, or other devices to manually collect images as a supplement or replacement for SVIs. For instance, Yan Li et al. used GoPro cameras mounted on a car to collect street images of Xining City, China and developed a vacancy estimation model using object detection techniques to infer the storefront vacancy rate
[12]. Similarly, Junjie Luo et al. employed drones to establish an oblique dataset for the river landscape visual evaluation of a section of the Grand Canal in Tianjin, China
[11]. Despite these efforts, no studies have specifically focused on manually collecting images of parks as a substitute for SVIs. Given the fact that certain routes or terrains within parks (e.g., staircases, stepping stones) are unsuitable to collect images by riding or driving with a GoPro and drones cannot replicate human perspectives, further exploration of image collection devices and their application in visual landscape assessment is necessary.
Simultaneously, as quality of life improves, there is a growing demand for high-quality green spaces, requiring urban planners to accurately identify and improve low-quality areas within these spaces. However, as existing research is hard to be applied in practice, relevant assessments rely primarily on designers' personal experience and subjective determination, often overlooking the public's actual needs and preferences. AI technology has the potential to address this challenge by effectively simulating public perceptions and conducting visual evaluations on environmental images
[6] [7] [10]. Nevertheless, AI-based public perception evaluation methods specific to parks have yet to be developed.
This study aims to establish an intelligent perception framework for urban green spaces based on urban park image collection and deep learning techniques. The goal is to enable rapid, accurate, and comprehensive evaluation of park visual quality, identify low-quality areas, and so as to inform spatial renewal and improvement plans. Specifically, this study focuses on the following questions. How to collect green space images more conveniently? How can AI-based algorithmic systems be developed to accurately reflect public perceptions and preferences of parks, thereby identifying low visual quality spaces? And in what ways can the perception evaluation results from such a system support theories related to visual landscape assessment? These explorations aim to promote the development of quantitative and evidence-based research on landscape perception and provide effective decision-making guidance for the renewal of urban green spaces.
2 Materials and Methods
2.1 Study Area
This study focused on Zhujiang Park, located in the Tianhe District of Guangzhou, Guangdong Province, China, which is an ecological park that integrates ecological, recreational, and cultural functions. It features diverse space types for various activities, covering an area of approximately 28 hm2. The park is highly popular and serves as a representative green space in the subtropical region of China.
2.2 Technical Framework
First, this study adopted a convenient approach to collecting park images using a panoramic camera, and verified its feasibility with on-site operations. Subsequently, the Seformer-B5 model trained on the ADE20K dataset was used to automatically identify 150 categories of objects in the collected images and calculated four objective evaluation metrics: green view index (GVI), sky view factor (SVF), road visibility index (RVI), and artificial structure visibility index (ASVI). Additionally, four subjective evaluation metrics—attractiveness, richness, naturalness, and depression—were employed. A public perception dataset was established through pairwise comparisons of the images, classifying each image into high or low values for the four subjective dimensions. The ViT-base-p16 model was trained on this dataset to enable effective prediction of the subjective metrics. Next, the spatial distribution of both objective and subjective evaluation metrics was visualized, enabling the identification of areas associated with low-scoring images. Finally, correlations between objective and subjective metrics were analyzed to provide insights for park renovations (Fig.1).
Fig.1 Technical framework. |
Full size|PPT slide
2.3 Data Collection and Processing
The images were collected on July 6, 2023, between 9:00 and 13:00, under clear weather conditions with temperatures around 30℃. A collector walked along all paths in the park, with the Insta360 ONE RS at a height of approximately 1.7 m. A handheld GPS sensor (Garmin eTrex 221x) recorded the location of each shooting point. According to previous experience, the research team captured images at road intersections, turning points, the midpoints of two turning points, and landmarks (e.g., buildings, pavilions, sculptures) to provide comprehensive visual information and ensure high efficiency. In this study, the interval distance between two collection points was no more than 40 m (about 50 walking steps). A total of 275 panoramic images were captured, all located along the centerline of the paths (Fig.2).
Fig.2 Image collection points. |
Full size|PPT slide
Collected images were processed with Insta360 Studio and all were clear enough for the study. Then, the research team extracted perspectives at 0° and 180° in flat mode, generating 550 images that represent the surroundings at each point. The images were subsequently matched with GPS spatial data using ArcMap 10.6.
2.4 Deep Learning-based Image Evaluation Methods
2.4.1 Objective Evaluation Metrics Extraction With Semantic Segmentation Model
Physical elements in the environment (both natural and artificial) significantly influence the visual quality of landscapes and people's aesthetic perceptions. Semantic segmentation technology, a key technique for scene understanding, significantly improves the accuracy of identifying physical elements by pixel-level classification.
This study employed the SegFormer-B5 model
[13], recognized for its high accuracy, to extract objective physical elements. The model consists of a hierarchical Transformer encoder and a lightweight All-MLP decoder. The Transformer encoder extracts image features using a self-attention mechanism to weigh important areas, enhancing segmentation performance. The All-MLP decoder fuses multi-level features and predicts semantic segmentation masks, outputting results through a fully connected layer. The model was trained using the ADE20K dataset
[14], an open dataset for scene understanding released by MIT in 2016, which includes 150 element categories. Testing results show that the SegFormer-B5 model outperforms earlier models such as FCN, PSPNet, and DeepLabV3+
[7], as well as advanced models like FPN and UPerNet on the ADE20K validation set
①.
① Model comparison data are available on the OpenMMlab GitHub page.
From the 150 element categories, this research extracted 13 common visual elements in parks
②, and calculated GVI and SVF drawing from existing visual perception research
[15]~
[17]. Additionally, as Zhujiang Park has numerous roads and artificial structures (e.g., walls, benches, streetlights, and fences), this study introduced RVI and ASVI as metrics
[11] (Tab.1).
② The 13 common visual elements in parks include wall, building, sky, tree, shrub, ground cover, first-class road, second-class road, third-class road, fence, skyscraper, bench, and streetlight.
Tab.1 Objective evaluation metrics |
Dimension | Metric | Definition | Source |
---|
Natural | Green view index (GVI) | Proportion of pixels representing vegetation (tree, shrub, and ground cover) | Refs. [15] [16] |
| Sky view factor (SVF) | Proportion of pixels representing sky | Ref. [17] |
Artificial | Road visibility index (RVI) | Proportion of pixels representing road (first-class, second-class, and third-class roads) | Ref. [11] |
| Artificial structure visibility index (ASVI) | Proportion of pixels representing artifacts (wall, building, fence, skyscraper, bench, and streetlight) | Ref. [11] |
2.4.2 Subjective Perception Score Prediction With Image Classification Model
Traditional studies on subjective landscape perceptions often adopt methods like rating scales, pairwise comparisons, or categorization
[18]. For instance, the Likert five-point scale requires respondents to rate images from 1 to 5. After obtaining scores, image classification models in deep learning can learn the relationship between scores and image features, simulating human perception process rating images from 1 to 5, enabling large-scale, rapid subjective perception evaluation. Existing studies mainly rely on large urban perception datasets. For example, the Place Pulse 2.0 dataset
[19] includes over 110, 000 images from 56 cities and over 80, 000 online volunteers evaluated the images through pairwise comparisons across various dimensions to generate perception scores. This dataset has been used in subsequent research to train image classification models for subjective perception scoring prediction
[6]. Relevant studies demonstrate that combining subjective visual surveys, image semantic segmentation, and image classification models can effectively and fairly collect and map street-level perceptions
[6]. Although the dataset lacks data park-scene image data that can be directly applied into this study, its construction methods and subjective perception prediction approaches provide a foundation for the subjective evaluation of this study.
(1) Establishing subjective evaluation metrics
Drawing from traditional visual landscape assessment research
[20]~
[24], four subjective evaluation metrics were selected: attractiveness, richness, naturalness, and depression. Attractiveness refers to the degree to which a park scene appeals to individuals, encompassing factors like beauty and uniqueness
[20]. Richness reflects the diversity and complexity of park elements, including species and design elements
[20] [21]. Naturalness represents the balance between human intervention and the natural state in the perceived park environment, informing park maintenance and management strategies
[22]. Depression measures the extent to which a park induces feelings of melancholy or discomfort
[23], often used to assess the impact of urban landscapes on physical and mental health
[24]. Parks inducing high levels of depression may discomfort the visitors and negatively affect overall experience.
(2) Collecting pairwise comparison results
Compared with directly obtaining numerical ratings from participants, pairwise comparison is a more effective and accurate way to gather perception data
[19]. First, to ensure the coverage of as many kinds of park scenes as possible, 550 photos were manually screened
③, and those with excessive similarity were excluded, leaving 200 valid photos. Next, the research team developed an online rating system using JavaScript, which dynamically adjusted the displayed images based on user selections and the existing relationships between images to ensure each photo had enough comparison times and valid ratings. In each comparison, two images were randomly selected from the 200 photos (Fig.3). Participants were asked to choose the image that better aligned with their preferences based on a question (e.g., "Which scene do you think exhibit more attractiveness/richness/naturalness/depression?"). Each participant performed four experiments, with each focusing on a single metric. To avoid fatigue, the number of comparisons in each experiment was limited to approximately 50 and kept less than 10 min. The experiment involved 35 master students, primarily majored in Landscape Architecture at South China University of Technology (12 males and 23 females), all of whom had no color blindness or color weakness. The experiment was conducted online over three days (March 3 ~ 5, 2024). On average, this research yielded 6, 702 pairwise comparison results across four metrics, with an average of 1, 675.5 results per metric.
③ Manual screening refers to image selection based on subjective perception and personal experience by the collector, without rigid quantitative criterion.
Fig.3 Subjective rating system based on pairwise comparison of images. |
Full size|PPT slide
(3) Calculation of subjective evaluation metrics
Drawing on existing research
[25], this research used the "strength of schedule" method to statistically analyze the subjective ratings, obtaining high and low scores for each metric (Fig.4).
Fig.4 Examples of image scoring across the four metrics. |
Full size|PPT slide
For the subjective evaluation metric m, this research defined the frequency of a given image i when being selected (Wi, m) and not being selected (Li, m) as follow:
where wi, m, and li, m represent the number of times the image being chosen, or not being chosen during comparisons.
The perception score (Qi, m) of each image i for the evaluation metric m can be defined as follow:
where niw and nil represent the total number of times image i being selected and not being selected, respectively. To further categorize the image scores Qi, m into low and high value, the research team defined the following binary label Wi, m∈{0, 1}, where 0 represents a low score and 1 represents a high score:
where μm and σm represent the mean and standard deviation of the perception scores across all data for the evaluation metric m, respectively.
(4) Image classification model training
After the above calculations, each of the 200 images was assigned a value of "0" or "1" for all four metrics, the public perception dataset was formed. The image classification model used these values as labels and images as explanatory variables for training. This research employed the ViT-base-p16 model
[26] for image classification. The ViT-base-p16 model divides input images into patches and treats each patch as a sequence element for input into a Transformer model. Using a self-attention mechanism, it weights important areas in the input images, effectively capturing important information. During the training, the ViT-base-p16 model was firstly pre-trained on the large-scale ImageNet-1k dataset to learn general representations of images. It was then fine-tuned on the public perception dataset for each of the metrics, resulting in four separate models for predicting attractiveness, richness, naturalness, and depression scores among all park images.
The performance of the model was evaluated using five-fold cross-validation. Specifically, the dataset of 200 images was randomly divided into five subsets. In each training iteration, four subsets were designated as the training set, and the remaining subset served as the validation set. The average accuracy across all five iterations was calculated to assess overall performance. The model with the highest accuracy was selected for scoring the subjective metrics. This approach ensured robustness of the training set while enhancing the model's generalizability to new data, enabling superior performance upon the limited sample size.
2.5 Integrated Evaluation of Subjective and Objective Metrics
The trained SegFormer-B5 and ViT-base-p16 models were employed to calculate both subjective and objective evaluation metrics for all 550 images. For each location, the average values of the two images were taken as the final scores. The score of these data points were visualized in ArcMap 10.6 to create spatial distribution maps of both objective and subjective metrics, identifying low score areas. Since the data did not conform to a normal distribution, Spearman correlation analysis was thus applied to examine the relationships between subjective and objective metrics with the major elements in the images that take a larger proportion, including vegetation (trees, shrubs, and ground cover) and park paths (first-class, second-class, and third-class roads).
3 Results and Discussion
3.1 Results of Objective Evaluation Metrics
Fig.5 shows examples of different landscape elements identified through semantic segmentation using the SegFormer-B5 model. Tab.2 summarizes the results of four objective evaluation metrics. Specifically, the average GVI was the highest (0.7115, with tree coverage at 0.3973, shrub coverage at 0.1691, and ground cover at 0.1450), indicating excellent vegetation conditions of the park, which is the main constituent of the park's landscape. The low average SVF (0.0737) corresponds to the high vegetation coverage, reflecting the dense tree canopy. The low averages of RVI and ASVI also reveal that the park is dominated by natural landscapes. The median values for these two metrics are close to their averages, indicating a relatively low coverage of roads and artificial structures. Moreover, the low SD suggests consistent internal park structures, contributing to a uniform visitor experience.
Fig.5 Examples of semantic segmentation results. |
Full size|PPT slide
Tab.2 Results of objective evaluation metrics |
Objective metric | Minimum | Maximum | Average | Median | SD |
---|
GVI | 0.0000 | 0.9731 | 0.7115 | 0.7351 | 0.1630 |
SVF | 0.0000 | 0.2815 | 0.0737 | 0.0643 | 0.0557 |
RVI | 0.0000 | 0.4237 | 0.1236 | 0.1011 | 0.0910 |
ASVI | 0.0000 | 0.3787 | 0.0286 | 0.0127 | 0.0462 |
3.2 Training Results of Subjective Evaluation Metric Prediction Model
The distribution of five-fold cross-validation data and model prediction accuracy for the subjective evaluation metrics (Fig.6) shows that, though the model's accuracy fluctuated across different metrics, the overall trend was stable. The average prediction accuracies for the test set were 69% (attractiveness), 70.5% (richness), 82% (naturalness), and 68.5% (depression), demonstrating a high reliability.
Fig.6 Results of fivefold crossvalidation and model prediction accuracy. |
Full size|PPT slide
The statistical results for subjective metrics (Tab.3) showed that naturalness had the highest mean value, indicating that the naturalness of Zhujiang Park was particularly prominent in human perception. This aligns with the semantic segmentation results. The range of naturalness was the largest (0.0443 ~ 0.8855), and the mean and median were close with the highest SD, reflecting significant spatial heterogeneity in vegetation distribution. Both attractiveness and depression also had relatively high mean values, suggesting that park scenes with high naturalness generally have strong appeal. However, excessively dense vegetation may increase feelings of depression. The SD for these two metrics were moderate and similar, indicating a relatively consistent variability across the sample. In contrast, the distribution of richness was more concentrated, with a lower SD and a narrower range (0.0732 ~ 0.5826), indicating relatively small differences in this metric. The lower mean value suggests that the diversity of visual elements in the park was relatively inadequate. This contrasts with the high variability in naturalness, indicating that while naturalness varies greatly across different scenes, the richness of visual elements is comparatively insufficient. This highlights a need to enhance landscape diversity in the park.
Tab.3 Statistical results of subjective evaluation metrics |
Subjective metric | Minimum | Maximum | Average | Median | SD |
---|
Attractiveness | 0.0783 | 0.7540 | 0.4285 | 0.4485 | 0.1476 |
Richness | 0.0732 | 0.5826 | 0.2841 | 0.2724 | 0.0942 |
Naturalness | 0.0443 | 0.8855 | 0.4303 | 0.3776 | 0.2021 |
Depression | 0.1619 | 0.8710 | 0.3821 | 0.4120 | 0.1468 |
3.3 Integrated Evaluation Results of Objective and Subjective Metrics
Overall, the spatial distribution patterns of objective and subjective metrics in Zhujiang Park showed similarities (Fig.7, Fig.8). The lawn area in front of the west entrance of the park (Zone C), characterized by open lawns and short trees with sparse shrubs, has wide paths and a higher SVF, but lower scores in GVI and naturalness, as well as relatively low attractiveness. The Kuailv Lake area in the central part of the park (Zone E), despite low GVI and high SVF, exhibited high attractiveness. This aligns with previous findings that people generally prefer water features
[5]. The scenic forest area in the eastern part of the park (Zone F) had high GVI and naturalness, making it a valuable asset in the bustling city center of Guangzhou. Its winding, undulating paths, combined with a low proportion of road and artificial structures, contributed to the overall high attractiveness. Some areas in the southwest part of the park have dense vegetation and diverse spatial variations, leading to higher richness. However, the variability of scenes between different points results in varying levels of attractiveness. The service buildings on the eastern side of the park (Zone G) has monotonous facades and low attractiveness, requiring special attention in park management.
Fig.7 Examples of park scenes. |
Full size|PPT slide
Fig.8 Spatial distribution results of objective and subjective metrics. |
Full size|PPT slide
Spearman correlation analysis (Fig.9) revealed a significant positive correlation between naturalness and attractiveness (
rs = 0.60), indicating that scenes with high naturalness are more favored by people. This finding aligns with previous research that visitors prefer environments with abundant vegetation. Such preferences may positively influence park usage frequency and visitor satisfaction
[27]. The proportion of ground cover was significantly negatively correlated with richness (
rs = − 0.48), suggesting that an increase in ground cover may reduce the overall richness. In Zhujiang Park, areas with a high proportion of ground cover are primarily located in the western part of the park, characterized by open lawns, leading to lower spatial richness. Naturalness was significantly positively correlated with GVI (
rs = 0.71), proportion of tree (
rs = 0.47), and proportion of shrub(
rs = 0.46). GVI and naturalness represent subjective and objective ecological environment, respectively, but the perception of naturalness is influenced not only by vegetation proportions but also by other factors, such as the overall composition of green elements and the presence of additional materials in the images (e.g., water, soil or permeable pavements). Depression shows a significant positive correlation with both naturalness (
rs = 0.64) and proportion of shrub (
rs = 0.65), indicating the dense vegetation may evoke feelings of depression.
Fig.9 Spearman correlation analysis results (* indicates significant correlation at the 0.05 level, ** indicates significant correlation at the 0.01 level). |
Full size|PPT slide
Furthermore, SVF, RVI, and ASVI show positive correlations with each other and negative correlations with all four subjective perception metrics as well as GVI. This suggests that increases in the proportions of sky, roads, and artificial structures are associated with decreases in vegetation and naturalness. In Zhujiang Park, areas with higher proportions of sky, roads, buildings, walls, and benches, e.g., the children's play area in the northwest (Zone B)and the lawn area in front of the west entrance (Zone C), tend to have lower vegetation coverage, wide paths, and open spaces, and their attractiveness and naturalness are lower—compared with the scenic forest area in the eastern part that is densely vegetated—though the openness of these areas reduces feelings of depression.
4 Conclusions and Prospects
The European Landscape Convention emphasizes that landscapes are a vital public interest deserving recognition and protection
[28]. Understanding how individuals observe and perceive landscapes and incorporating these insights into landscape planning and management is critical. This study adopts advanced image collection and AI technologies to develop a methodological framework for landscape research and practice centered on landscape perception. Overall, this study demonstrates three key contributions as follow.
1) This research implemented a convenient and efficient workflow by combining urban green space image collection by panoramic camera with advanced semantic segmentation and image classification models for unbiased assessments on park visual quality. This method overcomes limitations in traditional visual assessments such as inefficiencies in processing large volumes of images or fatigue in multi-scene evaluations, and validates the application of image big data and deep learning in landscape perception research.
2) Traditional visual assessment studies often rely on small-sized image datasets and lack accurate quantification of subjective and objective elements. This research precisely extracted and evaluated objective metrics and predicted scoring on subjective metrics, revealing that the presence of vegetation and water features enhances park attractiveness and stimulates positive perceptions. Conversely, higher proportions of sky, roads, and artificial structures are found to have negative effects.
3) Traditional research findings are often difficult to be applied directly to new scenarios' preference prediction. The intelligent method demonstrated in this paper can learn subjective scoring from a subset of scene images and predict scores for other new scenes, helping park managers efficiently identify low-scoring areas. This provides actionable guidance for urban green space renewal, demonstrating significant practical value.
Despite the contributions, this research has certain limitations. The image data and the number of participants were relatively limited, and the study focused exclusively on summer landscapes of Zhujiang Park, making it maybe difficult to generalize findings to parks of other types or in other seasons. Notably, the children's play area in the northwest had lower attractiveness, according to the research results, likely due to the preferences of the selected participants—university students—who may find areas characterized by low vegetation and richness less appealing. It also underlines a limitation of prior studies based on street view big data, which train models on generalized public preferences and fail to reflect the various needs of different user groups
[19] [25]. Additionally, panoramic camera images may have distortion, potentially affecting accuracy. Future studies should expand the dataset to include more diverse urban green spaces and seasonal landscapes; gather ratings from a broader range of users to improve green space perception datasets; and pay attention to the necessity of conducting preference surveys across diverse user groups.
Finally, the subjective and objective metrics extracted in this study could be integrated with other data, including park vitality, functional usage, and environmental quality, to further explore the relationships between factors including landscape attractiveness, user behavior patterns, and physical characteristics of a park. Such studies will support urban managers in systematic decision-making of developing more precise strategies, optimizing park functions, and improving urban landscape quality.
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}