Web-based multi-vision platform for earthwork productivity on construction sites using real-time model updating

Jeongbin HWANG; Insoo JEONG; Junghoon KIM; Seokho CHI

doi:10.1007/s11709-025-1197-0

Front. Struct. Civ. Eng. ›› 2025, Vol. 19 ›› Issue (6) :1021 -1040. DOI: 10.1007/s11709-025-1197-0

RESEARCH ARTICLE

Web-based multi-vision platform for earthwork productivity on construction sites using real-time model updating

Jeongbin HWANG ¹
, Insoo JEONG ²^,³
, Junghoon KIM ²^,³
, Seokho CHI ³^,⁴^,^†

Author information +

History +

PDF (4271KB)

Abstract

Earthwork productivity analysis is essential for successful construction projects. If productivity analysis results can be accessed anytime and anywhere, then project management can be performed more efficiently. To this end, this paper proposes an earthwork productivity monitoring framework via a real-time scene updating multi-vision platform. The framework consists of four main processes: 1) site-optimized database development; 2) real-time monitoring model updating; 3) multi-vision productivity monitoring; and 4) web-based monitoring platform for Internet-connected devices. The experimental results demonstrated satisfactory performance, with an average macro F1-score of 87.3% for continuous site-specific monitoring, an average accuracy of 86.2% for activity recognition, and the successful operation of multi-vision productivity monitoring through a web-based platform in real time. The findings can contribute to supporting site managers to understand real-time earthmoving operations while achieving better construction project and information management.

Graphical abstract

Keywords

online-active learning / site-customized monitoring / multi-vision monitoring / earthwork productivity analysis / web-based site monitoring platform

Cite this article

Download citation ▾

Jeongbin HWANG, Insoo JEONG, Junghoon KIM, Seokho CHI. Web-based multi-vision platform for earthwork productivity on construction sites using real-time model updating. Front. Struct. Civ. Eng., 2025, 19(6): 1021-1040 DOI:10.1007/s11709-025-1197-0

登录浏览全文

4963

注册一个新账户忘记密码

1 Introduction

The construction industry is a key sector impacting national gross domestic product, economic growth, and employment [1–3]. Enhancing productivity in construction projects can offer significant benefits both nationally and corporately. However, productivity growth in the construction industry is slower compared to other industries, with global construction productivity growing at 1.0% per year, compared to 2.7% for the global economy and 3.6% for manufacturing [4,5]. Many researchers are thus actively endeavoring to improve construction productivity, particularly in earthwork, which accounts for about 20% of total construction project costs [6,7].

Productivity monitoring in construction involves observing the efficiency of resources on-site, identifying factors inhibiting productivity, and making necessary adjustments [8–11]. This process is essential for successful construction project management and can support various decision-makings made by site managers. Recently, productivity monitoring has been attempted through various methods, including site visits by managers, or using diverse sensors, such as radio-frequency identification, global positioning system, and ultra-wide band [12–14]. Among these various approaches, vision-based productivity monitoring based on computer vision technology is being extensively researched for practical application in the field due to its easy installation and capacity to explain site conditions [15–18]. Cameras can collect more than just data on the type and location of target construction resources; they can also gather video stream data about the overall scene information where construction operation is taking. Hence, many researchers are endeavoring hard to apply such computer vision technology for vision-based productivity monitoring. This approach not only easily analyzes the location and behavior information of each construction resource but also offers the advantage of analyzing additional information including environmental conditions and information of non-target construction resources.

Most construction projects often span several years, requiring long-term productivity monitoring under ever changing site conditions. For successful site management, it is essential to be able to access productivity monitoring results anytime and anywhere. However, vision-based monitoring models are trained on images previously taken and gathered at distinct time intervals. As the construction project progresses, the visual features of construction sites (e.g., types of operating heavy equipment) change, which can lead to a drop in the model’s performance. Given this, continuous high-performance monitoring requires the development of a large, high-quality site-specific training image Database (DB) to retrain the model periodically when the model’s performance decreases. Furthermore, it is more efficient to understand widespread on-site situations when recording and collecting video data from multiple cameras simultaneously rather than relying on just one. However, since videos captured by multiple cameras are collected separately, an additional process is required for integrated information management. For comprehensive productivity monitoring of the site, if an object appears in one video and then disappears, only to be captured again in another video, the system must recognize that they are the same object and integrate the productivity-related information obtained from both videos.

To overcome the existing challenges, the authors propose an earthwork productivity monitoring framework via a real-time scene updating platform with multiple videos that are recorded by different cameras. The proposed framework consists of four main processes: 1) site-optimized DB development; 2) real-time monitoring model updating; 3) multi-vision productivity monitoring; and 4) web-based monitoring platform for Internet-connected devices. The authors developed a multi-vision productivity monitoring framework for construction equipment that can perform various earthmoving operations. This approach allows for the rapid development of site-optimized training image DB and ensures that the monitoring model can recognize when its performance decreases, minimizing user effort. It also integrates vision-based monitoring results from multiple videos and statistically analyzes productivity to automatically detect and propose solutions for productivity inhibitors.

This research has the following contributions. First, the proposed framework can be applied to any other construction sites without relying on viewpoints and target resources. Second, the vision-based monitoring model can be updated in real time for dynamic construction sites. Third, the authors integrated vision-based monitoring results from multiple videos by considering the conditions of the construction site. Fourth, real-time monitoring of large and complex construction sites has become possible, and the results can be monitored from anywhere at any time. Last, this research achieves the sustainable, high-performed, and long-term site monitoring. The authors focused on heavy equipment (i.e., dump trucks, excavators, dozers, and rollers) that perform various operations in adverse working environments at multiple zones. After this introduction, the paper reviews previous research on vision-based productivity monitoring. It then explains the processes involved in the framework and presents experimental results to validate the framework. Finally, the conclusions drawn from the research are discussed.

2 Related work

Numerous researchers endeavored to monitor construction sites automatically using deep learning-based computer vision techniques. Object detection is one of the most basic algorithms, and comprehensive efforts have been made to apply it to construction sites. For example, Kim et al. [19] tested the feasibility of a Faster Region-based Convolutional Neural Network (Faster R-CNN) [20] with Residual Neural Network-101 (ResNet-101) [21] to detect various types of construction resources, including workers, heavy equipment, and materials. Similarly, Li et al. [22] and Shin et al. [23] adopted You Only Look Once (YOLO) to detect resources (e.g., rebars and heavy equipment) on construction sites. Nath and Behzadan [24] also applied YOLO on construction sites under different visual conditions. Many other studies have been actively conducted on object tracking. Object tracking incorporates the temporal concept of consecutive images to determine whether objects in previous and subsequent frames are the same or different. For instance, Zhu et al. [25] combined a detector and tracker to leverage their strengths and mitigate their weaknesses in tracking construction resources. Xiao and Zhu [26] compared 15 tracking methods on sites and indicated that those built on sparse representation were more effective than the others. There were further attempts to increase the performance of object tracking. Jeong et al. [27] enhanced a tracking algorithm by unsupervised clustering-driven post-processing for curtain wall installation. Xiao and Kang [28] proposed a construction machine tracker, which detects construction machines by YOLOv3 in each frame and connects the detection results of two consecutive frames for tracking. Angah and Chen [29] proposed a gradient-based method for tracking multiple workers on construction sites. They improved the performance through detecting, matching, and re-matching the workers. Yan et al. [30] tracked dump trucks to monitor material arrival delays by integrating a deep CNN, a Kanade-Lucas-Tomasi corner feature tracker, and a hash-based occlusion handling strategy. These have proven the feasibility of vision-based tracking methods to monitor productivity.

The findings of earlier works led researchers to apply vision-based algorithms to recognize construction activities. Luo et al. [31] used three different CNNs that analyzed red-green-blue, optical flow, and gray channels for recognizing construction activities. Kim and Chi [32] and Zhang et al. [33] combined CNN and long short-term memory to generate time-spatial and temporal features of equipment operations. In other studies, Luo et al. [34] integrated relevance networks on CNN to extract semantic relevance representing the likelihood of cooperation or coexistence between different resources for activity recognition. Similarly, Roberts and Golparvar-Fard [35] used a hidden Markov model to understand the temporal sequence of earthmoving activities, and Soltani et al. [36] applied stereo-vision for estimating three-dimensional poses of excavators.

Recently, the majority of construction sites are equipped with multiple Closed-Circuit Television (CCTV) cameras, enabling comprehensive video monitoring of every working spot of the site. However, there is a limitation with vision-based monitoring to analyze multiple scene videos captured simultaneously. Unlike humans, when a resource captured by one CCTV appears in the video of another CCTV, the monitoring model shows difficulties from inherently determining if the two resources are the same one. Several studies have been conducted to empower vision-based monitoring to discern this information automatically. For example, Cheng et al. [37] designed a similarity loss function to encourage deep learning models to learn discriminative human features for robust tracking of individual workers. Zhang et al. [38] also applied the re-identification algorithm of workers, which adopts the distance-metric-based deep model. Wei et al. [39] extracted the feature maps using a spatial attention network. Existing computer vision techniques exhibit high performance only for images with visual characteristics similar to the train data. However, as construction project progresses over time, the visual characteristics of the site background change, leading to a gradual degradation in the performance of vision-based models. Additionally, to monitor a construction site successfully, it is necessary to analyze videos collected from multiple cameras simultaneously and to track each construction resource operated on diverse spots. For this, an optimized vision-based monitoring model is required for each camera, and these models must be capable of real-time updates. This paper resolved this by applying a baseline DB and an online-active learning approach, thereby implementing real-time updating of vision-based models.

Lastly, construction productivity can be defined in various ways, such as the amount of work completed per hour, or the time and cost required for a specific task. While its definition may vary depending on context, productivity is generally considered high when a task is completed in less time or at a lower cost. Most studies analyzing site productivity through vision-based monitoring estimate productivity based on task duration, defining it as higher when less time is required. Accordingly, productivity analysis is conducted by estimating the time each resource spends on a task. Kim et al. [40] estimated productivity by using CNN-based equipment detection results as inputs to an earthmoving simulation model. Besides, Kim et al. [41] applied deep neural network-based equipment and license plate detection results to track excavators on multiple videos and estimate earthwork productivity. Cheng et al. [42] conducted an excavator’s action recognition to calculate both work time and average cycle time for productivity analysis. However, these studies calculated the productivity based on the simulation while not reflecting the dynamic and complex nature of construction sites. To address the limitation, Bügler et al. [43] analyzed earthwork productivity by not only employing activity recognition but also using photogrammetry to create a point cloud to measure the volume of excavated soil. Additionally, Chen et al. [44] applied zero-shot learning for activity recognition to consider the characteristics of construction sites dynamically changing. However, from the practical application point of view, there is still a lack of studies that detect productivity from the multiple viewpoints to cover the entire site and identify productivity inhibitors and their causes. Furthermore, for successful site management, managers need to be able to confirm productivity analysis results anytime and anywhere. This study developed a web-based multi-vision platform accessible from any device, such as a smartphone, tablet personal computer and laptop.

3 Proposed framework

Fig.1 illustrates the proposed framework that includes four main processes: 1) site-optimized DB development; 2) real-time monitoring model updating; 3) multi-vision productivity monitoring; and 4) web-based monitoring platform for Internet-connected devices. The details of these processes are described in the following subsections.

3.1 Site-optimized database development

Due to the diverse visual characteristics present at individual construction sites, such as the backgrounds of the sites and the types of existing resources, it is necessary to generate a new training image DB to train a vision-based monitoring model. Constructing a large and high-quality training image DB, however, requires substantial effort. It involves visiting the site in person to install cameras, collecting video stream data, and annotating construction resources. During earthwork operations, the background of the site image continuously changes while the types of resources involved on the site remain consistent. Taking this into account, the authors aim to develop a DB that considers only the changes in the background of the construction site. We utilize real construction site images as the background for the images and develop a site-optimized DB by combining these backgrounds with a pre-generated image DB of construction resources.

To this end, it is beneficial to pre-develop a DB that embodies the visual characteristics of construction resources, named a baseline DB. This research automatically develops and collects a baseline DB using web crawling [45] and virtual reality techniques [46] with the application of the authors’ previous research findings. After that, the authors adopted a site-optimized DB [47], a large and high-quality image DB, by cross-oversampling the baseline DB and the previous real target site image. In detail, the resources only change in location or pose, while their other visual characteristics remain unchanged. On the other hand, the site background continuously changes as the construction process progresses. Considering this, the construction resources in the pre-developed baseline DB were resized and synthesized into real site images. The synthesized areas were placed in appropriate locations (e.g., an excavator cannot be in the sky, so it was only synthesized on the ground). Hwang et al. [47] demonstrated that a model trained using this approach can avoid background mismatches. Utilizing the site-optimized DB enables the development of a high-performance vision-based model with only a small number of target site images. Furthermore, since the baseline DB is already established, the time required for data preparation in actual field environments can be minimized.

3.2 Real-time monitoring model updating

Maintaining the high performance of vision-based monitoring models requires a large and high-quality training image DB and the model retraining to improve its performance whenever it decreases. To determine the performance decreases in real time, practitioners must monitor the analysis results and compare them with the ground truths. However, this traditional method demands the same effort as manually monitoring the site, making the use of computer vision techniques meaningless. To solve this problem, the model should be able to determine when its performance decreases and update itself in real time. To achieve this, the authors utilized the concept of active and online learning approaches. Active learning [48] is a semi-supervised learning technique that selects the most meaningful data from a large amount of training data and incrementally maximizes the performance of the analysis model by learning from it. Specifically, the uncertainty of the model’s predictions for each data point is evaluated, and data with high uncertainty is prioritized for learning [49,50]. The ability of the active learning approach to evaluate uncertainty and determine the need for learning is essential for maintaining model performance [51,52]. Second, online learning is another semi-unsupervised learning technique [53] that generates training data from actual test data to optimize model training in the test environment. It is suitable for analyzing video stream data because it processes and learns data sequentially. Relying solely on online learning does not allow the vision-based models to determine when training is required.

To solve these drawbacks, the authors propose an online-active learning algorithm for real-time monitoring model updating. The model analyzes the real-time video frames currently and evaluates the uncertainty. The model is maintained if the uncertainty does not exceed the threshold. If the uncertainty exceeds the threshold, it is determined that the model needs to be updated. The uncertainty is defined as the average confidence score of the last ten frames. The performance evaluation of object detection models is commonly conducted by comparing the predicted values with a confidence score of 0.5 or higher to the ground truth values. A confidence score of 0.5 or higher means that there is more than a 50% probability of being correct, and thus, the model considers these predictions as “correct”. This is why this threshold is chosen. Therefore, only confidence scores of 0.5 or higher are used for calculating uncertainty. When evaluating the performance of a vision-based model, the predicted values of the model are compared with ground truth values that are manually generated by humans. However, in online active learning, the model’s performance degradation must be assessed in real time, making it impossible to generate ground truth values. As a result, the existing evaluation method cannot be used to determine the need for model updates. Instead, uncertainty is defined based on the confidence scores of the model results, which is then used to predict model performance and decide whether an update is necessary (Eq. (1)). Consequently, lowering the uncertainty threshold increases the confidence score threshold, causing the online active learning loop to operate more frequently and improving model performance. In statistical analysis, confidence intervals are commonly used at levels of 0.9 or higher, reflecting a general acceptance of 0.9 as a high standard. Furthermore, the authors set a high threshold of 0.9 to ensure the performance of the model because the framework must rely solely on the predictions generated by itself. For this reason, the uncertainty threshold was set at 0.1, demonstrating sufficient training frequency and performance in this paper. This study applied YOLOv5 [54] with 10 epochs for the object detection module, which is one of the powerful object detection algorithms. When the model needs to be updated, a training image DB must be required. Therefore, the authors applied the large and high-quality site-optimized DB which was mentioned previously.

(1)

U n c e r t a i n t y = 1 − a v g (∑ i 10 C o n f i d e n c e S c o r e i) .

3.3 Multi-vision productivity monitoring

To monitor earthwork productivity at construction sites, it is necessary to recognize the activity of resources in image frames, single-vision productivity analysis, and integrate the multiple single-vision productivity analysis results. For this purpose, this section covers action recognition, single-vision productivity analysis, and multiple productivity analysis integration. For action recognition, the object tracking algorithm is used to assign IDs to each construction resource. The research applies the centroid tracker that tracks the centroid of the bounding boxes moving and changing over time. Based on the tracking results, the authors identified individual actions and interactions to recognize activities (Fig.2). First, the individual actions of each construction resource are determined. Individual actions refer to movements do not assign any meaning, and categorize them into move, stop, and semi-stop. “Move” refers to changing location, “semi-stop” represents movements in place but without changing the location, and “stop” refers to a state of no movement. The individual actions of move, semi-stop, and stop are determined based on the presence or absence of movement of the construction resource’s centroid and changes in the aspect ratio of the bounding box. When the resource’s centroid moves, it is classified as a “move”. When the resource’s centroid does not move, but the aspect ratio changes, it is determined as “semi-stop”. When neither the object’s centroid moves nor the aspect ratio changes, it is classified as “stop”. Specifically, the centroid movement is calculated as the ratio between the change in the center of the bounding box and the length of its diagonal. The presence of any change in the aspect ratio is determined using the aspect ratio change rate, with the threshold set to 0.1 in one second for all measurements. Second, based on the distance and actions of resources, interaction between the resources is determined. For example, loading work refers to the act of an excavator loading soil onto a dump truck. In this case, a dump truck is near the excavator, and the excavator continuously moves its arm and body in a “semi-stop” state. From the dump truck’s perspective, the excavator is nearby and will be in a “stop” state. Since the activities vary depending on the type of construction resource, it is necessary to set rules for each resource to be analyzed. The presence of interaction is determined by comparing the distance between the centers of the bounding boxes with half the sum of their diagonals. The activities can be classified into three different production states: productive, semi-productive, and non-productive. Productive activities directly impact productivity (e.g., load), semi-productive activities indirectly affect productivity (e.g., travel), and non-productive activities have no impact on productivity (e.g., idle). This research calls this a standard classification system (Tab.1). For example, the activity types of excavators are categorized as idle, travel, soil-manage, and load. Notably, during the loading operation, there is an interaction with a dump truck in a stop state. Conversely, the excavator is in a “semi-stop” state during the soil-managing operation, such as a loading operation. The soil-managing operation but does not have any interaction with other construction resources. Additionally, the excavator exhibits individual actions such as stop, move, and semi-stop while performing each operation. The semi-stop action refers to the state where the excavator moves its arm or body at the exact location for soil management or loading (Fig.3).

Earthwork productivity can be defined in a variety of ways. Among them, the authors defined earthwork productivity as (a) the total earthwork volume and (b) the productive work time ratio of each resource. The total earthwork volume is calculated by multiplying the number of loaded dump trucks by their capacity. The productive work time ratio quantifies the activity time duration each resource spends on various tasks. For single-vision productivity analysis, the productive work time of each resource on each video is comparatively analyzed. The productive work time ratio is defined as the proportion of ‘productive’ (load and unload) and ‘semi-productive’ (travel) activities within the total working time. Dump trucks with significantly longer durations for travel, load, or unload activities are scrutinized, as such patterns could indicate reduced productivity. These resources are considered inhibitor factors, and statistical analysis using Z-score normalization is conducted to detect them. Resources with a Z-score higher than the confidence interval (95%, Z-score threshold: 1.96) are deemed as outliers and they are assumed as inhibitors.

Multi-vision productivity monitoring requires integration of multiple productivity analysis results. The authors applied rule-based object re-identification which consists of the Line of Interest (LoI) and buffer algorithm [55]. The LoI method determines whether a resource has left the field of view, while the buffer algorithm stores the type and ID of resources transitioning between zones. The LoI involves creating a virtual line in the captured video (Fig.4). If a construction resource crosses that line, such as moving out of the video frame, it is determined that the resource has exited the zone. Conversely, if a resource crosses the virtual line in the opposite direction, it is concluded that the resource has entered the zone. Utilizing this principle, if the last centroid coordinate of a tracked resource crosses the virtual line, it means that the object has left the site and is moving to another location. The virtual lines are set differently based on the characteristics of the viewpoints. Once the camera is installed, the recording zone remains essentially unchanged, ensuring that these virtual lines stay fixed until filming concludes.

The buffers are defined to serve as links between actual zones. They function as an algorithm that temporarily stores the types and IDs of target construction resources while it transitions from exiting one zone to entering another. This helps preserve the IDs of resources that disappear from all video frames and aids in predicting which area (blind spot) the resource is located in during the transition (Fig.5). For instance, when a construction resource exits zone A, the buffer-AB stores the resource. After a while, if a construction resource enters zone B from zone A, which has the same type as the resource stored in buffer-AB, buffer-AB removes the stored resource and takes the ID for object re-identification. In detail, the minimum and maximum duration of using the buffer is set, because the resources need minimum time to travel and reduce tracking errors by refreshing the buffer.

3.4 Web-based monitoring platform for Internet-connected devices

If the results of site monitoring can be observed from anywhere at any time, site managers can make more proactive decisions for successful site management. This section is for developing a web-based monitoring platform to display the multi-vision monitoring results on various Internet-connected devices without spatiotemporal constraints. Considering this, a web-based platform user interface is developed to display both real-time multi-vision monitoring and productivity monitoring results (Fig.6). The platform is developed using Amazon Web Service (AWS) Elastic Compute Cloud (EC2) for multi-vision monitoring, AWS Simple Storage Service (S3) for storing results, and AWS Lightsail for the other basic functions.

For the real-time multi-vision monitoring results, the following processes must be performed in real time: (a) downloading each frame from the video stream data; (b) analyzing each frame without delay; (c) uploading the analysis result to the back-end server; and (d) converting the analyzed frames into video stream data for platform users. To do process (b) in real time, different types of AWS EC2 instances were validated for their suitability with object detection algorithms such as Faster R-CNN and YOLOv5 (Tab.2). According to Hwang et al. [47], the performance difference between Faster R-CNN and YOLOv5 is minimal when trained with a site-optimized DB. Therefore, the model selection was based on analysis time and instance cost. The most affordable option, g4dn.xlarge with YOLOv5, was chosen for its balance between analyzing time and cost. To perform processes (a) and (c) in real time, AWS S3 buckets are used for storing numerous analyzed image frames and serving as the platform’s back-end server. Three separate codes operate on the back-end of the AWS EC2 instance to minimize delays in downloading, analyzing, and uploading. The other one for displaying image frames on the platform operates on the back-end of the AWS Lightsail instance.

The user interface utilizes uploaded video data for productivity monitoring. The productivity monitoring results are uploaded and displayed automatically for the users. For real-time multi-vision monitoring, the system fetches the most recent image from the AWS S3 bucket every 0.1 s. The platform’s user interface comprises front-end and back-end. The front-end, directly shown to the user, is developed using TypeScript language with frameworks, such as Next.js and React.js, and the ant.design style library. The back-end, responsible for server processes, uses TypeScript with the Express.js framework, Object-Relational Mapping (ORM) with TypeORM, PostgreSQL DB, JsonWebToken for authentication, and the representational state transfer approach for the application programming interface.

For real-time monitoring, the time requirement of each process in the back-end is specifically validated and predicted. The video captured on the site needs to be downloaded in real time by AWS EC2 instances, the results of the multi-vision monitoring must be uploaded to an AWS S3 bucket, and then the uploaded images must be retrieved on the web-based platform. Initially, the authors estimated the time required to fetch a video captured on-site. An image with a 1920 × 1080 resolution consists of a total of 2073600 pixels, with each pixel containing information for red, green, and blue. Each color can have a value ranging from 0 to 255, requiring 1 byte of storage space per color. Therefore, an image of 1920 × 1080 resolution requires a storage size of 6220800 bytes, equivalent to 5.93 MB. Hence, for example, to analyze one video at three frames per second (fps), the on-site camera must upload data at a rate of 17.80 MB per second. Considering three video stream data, an upload speed of 53.40 MB per second is required. However, as of 2023, South Korea’s 5G internet upload speed, despite being one of the world’s internet powerhouses, is only 42.0 MB/s, equating to an upload capacity of about 5.25 MB/s [56]. Therefore, it is practically impossible to fetch high-definition videos captured by CCTV cameras installed at construction sites in real-time using wireless internet (e.g., 5G internet), and the research only considered wired internet. While wired internet services range from 500 MB/s to 10 GB/s, the authors assumed a 500 MB/s service for a cost-effective solution. The upload speed of a 500 MB/s wired internet service equates to 62.50 MB/s. This speed is sufficient to receive three fps videos in real time (53.40 MB/s < 62.50 MB/s), but not enough to receive four videos in real time.

4 Experimental results and discussion

To validate the proposed framework, the experiments were conducted on Sejong-Anseong highway construction site––Section 4 in South Korea on April 12, 2023. Three different videos recorded soil unloading zone (view 1), construction resources’ traveling zone (view 2), and soil loading zone (view 3). For view 1, the camera recorded the unloading zone where dump trucks, dozers, and rollers were operating. Regarding view 2, the camera recorded the traveling zone, a connection zone between view 1 and view 3. View 3 was a video stream that an excavator and dump trucks were loading the soil. A total of 97080 image frames (view 1: 34520 images, view 2: 32720 images, view 3: 29840 images) were collected with 1920 × 1080, 1440 × 1080, and 1280 × 720 resolutions and three fps from the construction site. Since each zone has specific areas where tasks are primarily performed, cameras were strategically installed to collect video footage while ensuring that occlusion between resources within these work areas was minimized. The baseline DB and the site-optimized DB contained dump trucks, excavators, dozers, and rollers’ information (e.g., visual characteristics). For the continuous multi-vision monitoring in real time, the training image DB development and model training were conducted on the AWS EC2 instances-g4dn.xlarge with an additional 500 GB (Ubuntu 20.04 LTS) storage for each view.

4.1 Object detection using real-time model updating

Object detection using real-time model updating consists of online-active learning and site-optimized DB. Before developing the site-optimized DB, a baseline DB is required. The baseline DB was built by web crawling and virtual reality, and 3424 images of baseline DB were pre-developed. Furthermore, the site-optimized DB of 3424 images was developed in 1100 s (Fig.7). Considering that labeling images typically takes more than 10 s per image [32,48], the experiment demonstrated a time reduction of 96.79%, proving that the site-optimized DB can be used for real-time monitoring model updating.

When real-time model updating with the site-optimized DB was applied to each view, it achieved an average online-active learning operating frequency of 1114.36 s and an average macro F1-score of 87.30%. The longer frequency means the model’s high performance is maintained for a longer time. The frequency and macro F1-score for each view was: view 1 1643.81 s and 84.81%, view 2 894.31 s and 86.84%, and view 3 804.95 s and 90.25%. Besides, when the real-time model updating used only the target site image (without site-optimized DB), it achieved an average online-active learning operating frequency of 37.29 s and an average macro F1-score of 46.00% (Tab.3). The examples of object detection using real-time model updating results are Fig.8. These results demonstrate that the proposed approach can be applied to various sites and resources, and the site-optimized DB can maintain the model’s high performance for a longer duration.

4.2 Activity recognition

Activity recognition was carried out through the process of individual action recognition and interaction analysis. When the activity recognition model was applied to each view, it achieved an average accuracy of 86.20%. Specifically, the accuracy for each view was: view 1 81.47%, view 2 96.84%, and view 3 80.28% (Tab.4, Fig.9 and Fig.10). In view 1, the altitude difference between the working area and the camera installation spot was not drastically different. This allowed for practical centroid movement analysis when construction resources moved perpendicularly to the camera’s direction. However, it was hard to identify centroid movement when the movement was parallel to the camera’s direction. This made distinguishing between ‘move’ and ‘stop’ less accurate. Consequently, there was more confusion between statuses that could be discerned from ‘move’ and ‘stop’ (e.g., a dump truck’s travel versus idle) than between other activities. Additionally, in view 3, the construction resources were captured from a distance, making it detect and track smaller than the construction resources in view 1 and 2. This smaller bounding box led to errors when analyzing the interaction between different resources. This confused the excavator’s activity status (i.e., load and soil-manage).

4.3 Multi-vision earthwork productivity monitoring

Multi-vision earthwork productivity monitoring was conducted by integrating the three different productivity monitoring results. The proposed multi-vision productivity monitoring result is as Tab.5. In the case of view 1, out of 29 dump trucks, six dump trucks were evaluated as outliers (travel: #3, #29, idle: #17, #26, unload: #24, #27). For instance, dump trucks #3 and #29 were analyzed to had taken longer travel time than other dump trucks (Fig.11). Next, out of 71 dump trucks, three dump trucks were evaluated as outliers on view 2 (travel: #56, idle: #8, #52). To confirm the analysis result, the authors manually reviewed the video. The authors observed that dump truck #8 idled because it had to wait for an oncoming dump truck from the opposite direction, and it resulted in an extended idle time. Lastly, in the case of view 3, out of 30 dump trucks, six dump trucks were identified as outliers (travel: #9, idle: #18, #27, #29, #30, load: #13). Based on the authors’ manual review, for example, dump truck #13 moved near the excavator to receive soil, but the excavator had not finished the soil management, causing the dump truck to remain idle for an extended duration. The dump truck was, however, too close to the excavator; it was recognized as being in a ‘load’ state (with the excavator in a ‘semi-stop’ state and the dump truck in a ‘stop’ state, and the interaction between the dump truck and the excavator recognized as yes), analyzed as in a longer load time (Fig.12). The total earthwork volume is presented in Tab.6. Specifically, it determined the number of dump trucks that performed loading operations, and through this, the authors could derive the total earthwork volume.

For the multi-vision earthwork productivity monitoring, the productivity monitoring results of three viewpoints should be integrated to re-identify the IDs of the tracked resources. A total of 88560 images were used for the experiments, with 29520 images from each view, covering the same footage from 08:42 to 11:26 AM under changing sunrise conditions. Tab.7 presents the re-identification results for multi-vision monitoring. Among four types of resources, only dump trucks were observed to move between different views. Consequently, the analysis focused on the re-identification of dump trucks across four scenarios: moving from zone 1 to 2, 2 to 1, 2 to 3, and 3 to 2. The average accuracy of the re-identification process was 95.26%, with accuracies of 95.83% for zone 1 to 2, 85.19% for zone 2 to 1, and 100% for both zone 2 to 3 and zone 3 to 2. Occlusion between resources occurred continuously in zone 2. Despite being a transition zone that connects two different zones and having frequent movement of various resources, a high re-identification accuracy was observed between zone 2 and zone 3. However, the accuracy related to zone 1 was relatively low and it can be attributed to the periodic changes in the location where the dump truck unloads, which make it difficult to record all relevant scenes in the video.

To integrate the IDs, the minimum and maximum duration of traveling from view 1 to 2 was set as five seconds and 100 s, and view 2 to 3 was set as 160 and 330 s by considering the conditions of the target construction site (e.g., distance between the zones). The IDs were successfully integrated, and it was confirmed that three dump trucks were successfully and continuously tracked (Tab.8). They were named as dump truck A, B, and C.

Based on the ID integration result, multi-vision earthwork productivity monitoring was conducted (Tab.9). To compare the performance of three dump trucks, the productivity ratio (productive work time ratio) was calculated. As a result, dump trucks B and C showed the highest productivity ratio, while dump truck A had the lowest. Notably, dump truck A had a longer idle time that was almost twice as long as that of other dump trucks. It confirmed that this outcome was due to dump truck A continuously idling before loading in view 3.

4.4 Site monitoring via the web-based platform

The multi-vision monitoring results were uploaded and displayed in real-time on a web-based platform. Using the YOLOv5-based real-time monitoring model updating, images with a resolution of 1920 × 1080 were analyzed on average within 0.02 s. To exclude any delay during analysis, the authors assumed the required time to analyze the image as maximum 0.04 s. Based on the detection results, tracking was utilized to assign IDs, and through both individual action and interaction analysis, the activity recognition process was completed within a maximum of 0.01 s per frame. Moreover, the time taken to integrate the vision-based monitoring results for each video was, on average, 0.005 s per frame (Fig.13).

All subsequent processes occurred within the AWS environment. It took approximately 0.1 s to upload the multi-vision monitoring result images to the S3 bucket and about 0.1 s to retrieve them in AWS Lightsail. In summary, the total time required to analyze the image frames and display the real-time monitoring results on the platform was, on average, 0.255 s. This is shorter than 0.33 s required to download an image frame from 3 fps real-time video, verifying that the analysis can be done in real time. The users can check the displayed productivity monitoring results by resources as illustrated in Fig.14.

In summary, the object detection model achieved a real-time model updating frequency of 1114.36 s and a macro F1-score of 87.30%. Based on the object detection results; despite analyzing the activities of four types of construction resources in three different viewpoints, it demonstrated a sufficient performance on activity recognition with an accuracy of 86.20%. Furthermore, multi-vision earthwork productivity monitoring was successfully carried out by re-identification of IDs with an accuracy of 95.26%. Lastly, the results of multi-vision productivity monitoring were successfully displayed through the web-based platform in real time. Although this study focused on three site videos and four construction resource types for validation, the framework can be applied to other sites directly by re-building site-optimized DB and changing minor parameters such as modifying the site condition (e.g., site layout).

5 Conclusions

In this study, the authors proposed a sustainable framework for multi-vision earthwork productivity monitoring. The proposed framework involves four processes: 1) site-optimized DB development; 2) real-time monitoring model updating; 3) multi-vision productivity monitoring; and 4) web-based monitoring platform for Internet-connected devices. To validate the proposed framework, the authors conducted the experiments. Using the site-optimized DB allowed to increase the model update frequency by 29.88 times and improve the macro F1-score by 41.30% for detection. Activity recognition for four types of resources achieved a macro F1-score of 86.20%, and successful vision-based productivity monitoring was carried out on three videos, which were then integrated to realize multi-vision earthwork productivity. Furthermore, these monitoring results were successfully displayed on the web-based platform, enabling multi-vision site monitoring without spatiotemporal constraints.

Given the benefits of the proposed framework, this study makes the following contributions. First, this study proposed a generalized framework to apply vision-based monitoring to any other construction site without relying on camera viewpoints and target resources. The proposed framework was able to customize the monitoring model regardless of the site conditions (e.g., characteristic of the image background). Second, it became possible to update the vision-based monitoring models in real time. This can enable the rapid application of computer vision techniques to construction sites. Third, the authors integrated vision-based monitoring results from multiple videos by considering the conditions of the construction site. Fourth, real-time analysis of huge and dynamic construction sites became possible, and the results can be monitored on various devices. Lastly, this research can support other research that requires long-term monitoring (e.g., progress monitoring) by real-time monitoring model updating.

Although interesting findings were observed in this study, there are still limitations to be addressed. For instance, we used videos captured outdoors during daylight hours, ensuring adequate brightness for vision-based monitoring. As a result, we did not specifically consider variations in lighting conditions. However, it is important to note that most construction activities are conducted during daylight hours for safety and operational efficiency, minimizing the impact of extreme lighting changes. According to Occupational Safety and Health Administration Standard 1926.56(a) [57], earthwork sites must maintain an illumination level of at least three foot-candles to ensure safe working conditions. This reinforces the assumption that lighting conditions in typical construction environments are generally adequate for vision-based monitoring. Nonetheless, certain construction environments may require greater robustness against lighting variations. Future research can explore adaptive techniques to improve system performance under diverse lighting conditions. Additionally, this study specifically focused on earthwork, where large construction equipment plays a dominant role. Earthwork sites typically have relatively structured movement patterns, making object re-identification more feasible and effective. However, generalizing the proposed framework to different site conditions, such as small size earthwork sites in a city cluttered with the increased number of workers or earthwork sites in highway work zones causing more complex object interactions, is also important. Future research could explore methods to enhance object re-identification under such conditions.

Based on the findings, additional research opportunities are presented. First, with maintaining the high performance of the monitoring model, it is no longer about applying state-of-the-art vision technologies to construction sites in the short-term. Instead, applying them consistently over the longer term can maximize the potential for utilizing computer vision technologies in construction sites. Second, by collecting the location or activity information of each resource continuously, it is possible to develop a digital twin that represents the construction site in a 3D virtual space in real time. Continuously implementing a construction site into a digital twin in real time allows for the simulation of all scenarios in advance. Furthermore, construction productivity can be enhanced for a wider range of heavy operations, and their safety can be also investigated. The authors believe that with more such achievements, real-time multi-vision monitoring on construction sites can lead to future innovative and intelligent construction sites.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Delgado J M D, Oyedele L, Ajayi A, Akanbi L, Akinade O, Bilal M, Owolabi H. Robotics and automated systems in construction: Understanding industry-specific challenges for adoption. Journal of Building Engineering, 2019, 26: 100868

[2]	Pan W, Chen L, Zhan W. PESTEL analysis of construction productivity enhancement strategies: A case study of three economies. Journal of Management Engineering, 2019, 35(1): 05018013

[3]	Ghodrati N, Yiu T W, Wilkinson S, Shahbazpour M. Role of management strategies in improving labor productivity in general construction projects in New Zealand: Managerial perspective. Journal of Management Engineering, 2018, 34(6): 04018035

[4]	Bankvall L, Bygballe L E, Dubois A, Jahre M. Interdependence in supply chains and projects in construction. Supply Chain Management, 2010, 15(5): 385–393

[5]	BarbosaFWoetzel JMischkeJRibeirinhoM JSridharM ParsonsMBertram NBrownS. Reinventing Construction: A Route to Higher Productivity. Brussels: McKinsey & Company, 2017

[6]	KangS HSeo J WBaikK G. 3D-GIS based earthwork planning system for productivity improvement. In: Proceedings of the Construction Research Congress 2009: Building a Sustainable Future. Seattle, WA: ASCE, 2012,151–160

[7]	Wong E, Swei O. New construction cost indices to improve highway management. Journal of Management Engineering, 2021, 37(4): 04021030

[8]	Kisi K P, Mani N, Rojas E M, Foster E T. Optimal productivity in labor-intensive construction operations: Pilot study. Journal of Construction Engineering and Management, 2017, 143(3): 04016107

[9]	Chen C, Zhu Z, Hammad A. Automated excavators activity recognition and productivity analysis from construction site surveillance videos. Automation in Construction, 2020, 110: 103045

[10]	Wu H, Zhong B, Li H, Guo J, Wang Y. On-site construction quality inspection using blockchain and smart contracts. Journal of Management Engineering, 2021, 37(6): 04021065

[11]	Wang S H. Engineering productivity and unit price assessment model. Journal of Management Engineering, 2022, 38(1): 04021076

[12]	Pradhananga N, Teizer J. Automatic spatio-temporal analysis of construction site equipment operations using GPS data. Automation in Construction, 2013, 29: 107–122

[13]	Lu W, Huang G Q, Li H. Scenarios for applying RFID technology in construction project management. Automation in Construction, 2011, 20(2): 101–106

[14]	Zhang C, Shen W, Ye Z. Technical feasibility analysis on applying ultra-wide band technology in construction progress monitoring. International Journal of Construction Management, 2022, 22(15): 2951–2965

[15]	Gu Y, Ai Q, Xu Z, Yao L, Wang H, Huang X, Yuan Y. Cost-effective image recognition of water leakage in metro tunnels using self-supervised learning. Automation in Construction, 2024, 167: 105678

[16]	Ai Q, Yuan Y, Bi X. Acquiring sectional profile of metro tunnels using charge-coupled device cameras. Structure and Infrastructure Engineering, 2016, 12(9): 1065–1075

[17]	Ai Q, Yuan Y. Rapid acquisition and identification of structural defects of metro tunnel. Sensors, 2019, 19(19): 4278

[18]	Kim J, Hwang J, Jeong I, Chi S, Seo J O, Kim J. Generalized vision-based framework for construction productivity analysis using a standard classification system. Automation in Construction, 2024, 165: 105504

[19]	Kim H, Kim H, Hong Y W, Byun H. Detecting construction equipment using a region-based fully convolutional network and transfer learning. Journal of Computing in Civil Engineering, 2018, 32(2): 04017082

[20]	Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149

[21]	HeKZhangX RenSSunJ. Deep residual learning for image recognition. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2016. New York, NY: IEEE, 2016, 770–778

[22]	Li Y, Lu Y, Chen J. A deep learning approach for real-time rebar counting on the construction site based on YOLOv3 detector. Automation in Construction, 2021, 124: 103602

[23]	Shin Y, Choi Y, Won J, Hong T, Koo C. A new benchmark model for the automated detection and classification of a wide range of heavy construction equipment. Journal of Management Engineering, 2024, 40(2): 04023069

[24]	Nath N D, Behzadan A H. Deep convolutional networks for construction object detection under different visual conditions. Frontiers in Built Environment, 2020, 6: 97

[25]	Zhu Z, Ren X, Chen Z. Integrated detection and tracking of workforce and equipment from construction jobsite videos. Automation in Construction, 2017, 81: 161–171

[26]	Xiao B, Zhu Z. Two-dimensional visual tracking in construction scenarios: A comparative study. Journal of Computing in Civil Engineering, 2018, 32(3): 04018006

[27]	Jeong I, Hwang J, Kim J, Chi S, Hwang B G, Kim J. Vision-based productivity monitoring of tower crane operations during curtain wall installation using a database-free approach. Journal of Computing in Civil Engineering, 2023, 37(4): 04023015

[28]	Xiao B, Kang S C. Vision-based method integrating deep learning detection for tracking multiple construction machines. Journal of Computing in Civil Engineering, 2021, 35(2): 04020071

[29]	Angah O, Chen A Y. Tracking multiple construction workers through deep learning and the gradient based method with re-matching based on multi-object tracking accuracy. Automation in Construction, 2020, 119: 103308

[30]	Yan X, Zhang H, Gao H. Mutually coupled detection and tracking of trucks for monitoring construction material arrival delays. Automation in Construction, 2022, 142: 104491

[31]	Luo H, Xiong C, Fang W, Love P E D, Zhang B, Ouyang X. Convolutional neural networks: Computer vision-based workforce activity assessment in construction. Automation in Construction, 2018, 94: 282–289

[32]	Kim J, Chi S. Action recognition of earthmoving excavators based on sequential pattern analysis of visual features and operation cycles. Automation in Construction, 2019, 104: 255–264

[33]	Zhang J, Zi L, Hou Y, Wang M, Jiang W, Deng D. A deep learning-based approach to enable action recognition for construction equipment. Advances in Civil Engineering, 2020, 2020(1): 8812928

[34]	Luo X, Li H, Cai D, Dai F, Seo H O, Lee S H. Recognizing diverse construction activities in site images via relevance networks of construction-related objects detected by convolutional neural networks. Journal of Computing in Civil Engineering, 2018, 32(3): 04018012

[35]	Roberts D, Golparvar-Fard M. End-to-end vision-based detection, tracking and activity analysis of earthmoving equipment filmed at ground level. Automation in Construction, 2019, 105: 102811

[36]	Soltani M M, Zhu Z, Hammad A. Framework for location data fusion and pose estimation of excavators using stereo vision. Journal of Computing in Civil Engineering, 2018, 32(6): 04018045

[37]	Cheng J C P, Wong P K Y, Luo H, Wang M, Leung P H. Vision-based monitoring of site safety compliance based on worker re-identification and personal protective equipment classification. Automation in Construction, 2022, 139: 104312

[38]	Zhang Q, Wang Z, Yang B, Lei K, Zhang B, Liu B. Reidentification-based automated matching for 3D localization of workers in construction sites. Journal of Computing in Civil Engineering, 2021, 35(6): 04021019

[39]	Wei R, Love P E D, Fang W, Luo H, Xu S. Recognizing people’s identity in construction sites with computer vision: A spatial and temporal attention pooling network. Advanced Engineering Informatics, 2019, 42: 100981

[40]	Kim H, Bang S, Jeong H, Ham Y, Kim H. Analyzing context and productivity of tunnel earthmoving processes using imaging and simulation. Automation in Construction, 2018, 92: 188–198

[41]	Kim H, Ham Y, Kim W, Park S, Kim H. Vision-based nonintrusive context documentation for earthmoving productivity simulation. Automation in Construction, 2019, 102: 135–147

[42]	Cheng M Y, Cao M T, Nuralim C K. Computer vision-based deep learning for supervising excavator operations and measuring real-time earthwork productivity. Journal of Supercomputing, 2023, 79(4): 4468–4492

[43]	Bügler M, Borrmann A, Ogunmakin G, Vela P A, Teizer J. Fusion of photogrammetry and video analysis for productivity assessment of earthwork processes. Computer-Aided Civil and Infrastructure Engineering, 2017, 32(2): 107–123

[44]	Chen C, Xiao B, Zhang Y, Zhu Z. Automatic vision-based calculation of excavator earthmoving productivity using zero-shot learning activity recognition. Automation in Construction, 2023, 146: 104702

[45]	Hwang J, Kim J, Chi S, Seo J O. Development of training image database using web crawling for vision-based site monitoring. Automation in Construction, 2022, 135: 104141

[46]	Lee J G, Hwang J, Chi S, Seo J. Synthetic image dataset development for vision-based construction equipment detection. Journal of Computing in Civil Engineering, 2022, 36(5): 04022020

[47]	Hwang J, Kim J, Chi S. Site-optimized training image database development using web-crawled and synthetic images. Automation in Construction, 2023, 151: 104886

[48]	Kim J, Hwang J, Chi S, Seo J O. Towards database-free vision-based monitoring on construction sites: A deep active learning approach. Automation in Construction, 2020, 120: 103376

[49]	Meng D, Yang S, de Jesus A M P, Zhu S P. A novel Kriging-model-assisted reliability-based multidisciplinary design optimization strategy and its application in the offshore wind turbine tower. Renewable Energy, 2023, 203: 407–420

[50]	Meng D, Yang S, de Jesus A M P, Fazeres-Ferradosa T, Zhu S P. A novel hybrid adaptive Kriging and water cycle algorithm for reliability-based design and optimization strategy: Application in offshore wind turbine monopile. Computer Methods in Applied Mechanics and Engineering, 2023, 412: 116083

[51]	Meng D, Yang H, Yang S, Zhang Y, de Jesus A M P, Correia J, Fazeres-Ferradosa T, Macek W, Branco R, Zhu S P. Kriging-assisted hybrid reliability design and optimization of offshore wind turbine support structure based on a portfolio allocation strategy. Ocean Engineering, 2024, 295: 116842

[52]	Yang S, Meng D, Wang H, Yang C. A novel learning function for adaptive surrogate-model-based reliability evaluation. Philosophical Transactions of the Royal Society A: Mathematical, Physical, and Engineering Sciences, 2024, 382(2264): 20220395

[53]	Kim J, Chi S. Adaptive detector and tracker on construction sites using functional integration and online learning. Journal of Computing in Civil Engineering, 2017, 31(5): 04017026

[54]	RedmonJDivvala SGirshickRFarhadiA. You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York, NY: IEEE, 2016, 779–788

[55]	Park C, Chun H, Chi S. Multi-camera people counting using a queue-buffer algorithm for effective search and rescue in building disasters. KSCE Journal of Civil Engineering, 2024, 28(6): 2132–2146

[56]	FoggI. Benchmarking the Global 5G Experience. Opensignal Limited 2023 June. 2023

[57]	OccupationalSafetyHealthAdministration Standard 1926. Safety and Health Regulations for Construction. Washington, D.C.: OSHA, 1926

RIGHTS & PERMISSIONS

The Author(s). This article is published with open access at link.springer.com and journal.hep.com.cn

PDF (4271KB)

653

Accesses

Citation

Detail

Sections

Recommended

About the journal

Aims & scope

Description

Editorial board

Contact us

Latest issue

Just accepted

Collections

Authors & reviewers

Online submisson

Call for papers

Guidelines for authors