A deep neural network combined with a two-stage ensemble model for detecting cracks in concrete structures

Hatice Catal REIS; Veysel TURK; Cagla Melisa KAYA YILDIZ; Muhammet Furkan BOZKURT; Seray Nur KARAGOZ; Mustafa USTUNER

doi:10.1007/s11709-025-1199-y

Front. Struct. Civ. Eng. ›› 2025, Vol. 19 ›› Issue (7) : 1091 -1109. DOI: 10.1007/s11709-025-1199-y

RESEARCH ARTICLE

A deep neural network combined with a two-stage ensemble model for detecting cracks in concrete structures

Author information +

History +

PDF (3397KB)

Abstract

Detection of cracks in concrete structures is critical for their safety and the sustainability of maintenance processes. Traditional inspection techniques are costly, time-consuming, and inefficient regarding human resources. Deep learning architectures have become more widespread in recent years by accelerating these processes and increasing their efficiency. Deep learning models (DLMs) stand out as an effective solution in crack detection due to their features such as end-to-end learning capability, model adaptation, and automatic learning processes. However, providing an optimal balance between model performance and computational efficiency of DLMs is a vital research topic. In this article, three different methods are proposed for detecting cracks in concrete structures. In the first method, a Separable Convolutional with Attention and Multi-layer Enhanced Fusion Network (SCAMEFNet) deep neural network, which has a deep architecture and can provide a balance between the depth of DLMs and model parameters, has been developed. This model was designed using a convolutional neural network, multi-head attention, and various fusion techniques. The second method proposes a modified vision transformer (ViT) model. A two-stage ensemble learning model, deep feature-based two-stage ensemble model (DFTSEM), is proposed in the third method. In this method, deep features and machine learning methods are used. The proposed approaches are evaluated using the Concrete Cracks Image Data set, which the authors collected and contains concrete cracks on building surfaces. The results show that the SCAMEFNet model achieved an accuracy rate of 98.83%, the ViT model 97.33%, and the DFTSEM model 99.00%. These findings show that the proposed techniques successfully detect surface cracks and deformations and can provide practical solutions to real-world problems. In addition, the developed methods can contribute as a tool for BIM platforms in smart cities for building health.

Graphical abstract

Keywords

concrete cracks image dataset / crack detection / depthwise separable convolution / multi-scale feature fusion / SCAMEFNet deep neural network / two-stage ensemble learning model

Cite this article

Download citation ▾

Hatice Catal REIS, Veysel TURK, Cagla Melisa KAYA YILDIZ, Muhammet Furkan BOZKURT, Seray Nur KARAGOZ, Mustafa USTUNER. A deep neural network combined with a two-stage ensemble model for detecting cracks in concrete structures. Front. Struct. Civ. Eng., 2025, 19(7): 1091-1109 DOI:10.1007/s11709-025-1199-y

登录浏览全文

4963

注册一个新账户忘记密码

1 Introduction

Concrete construction comes to the fore with the increasing interest in rapidly urbanized and high-rise buildings. Natural disasters sometimes cause the loss of functional and physical forms over time in civil infrastructure, the life cycle of the building, and the human origin. As a result of the rapidly increasing concrete structuring, both the end of the natural life of the building [1] and the structural deformations that occur due to this and the deterioration due to natural disasters may pose vital risks. It is critical for living things that the building can serve safely and healthily without creating a security threat. Therefore, the health and condition evaluations of the structures are essential. Concrete structures are subject to deformation, sometimes due to anthropogenic or natural disasters [2,3], unpredictable stresses, and sometimes due to their natural cycle. Cracks in concrete structures are among the crucial problems [3]. Early detection of minor damage is essential to prevent the growth of permanent cracks, reduce costs, and prevent potential risks [4]. Sometimes the cracks are too small to be noticed, making analysis and evaluation complex. Structural health assessment is usually done manually or based on expert opinion. These methods are sometimes difficult and time-consuming, and in some cases, they can risk human life [5,6]. Recent technological advancements and enhanced hardware have led to substantial progress in detecting, classifying, and segmenting structural deformations. In particular, critical progress has been made in crack detection with the progression of image processing and artificial intelligence techniques [2]. In this context, machine learning, one of the artificial intelligence techniques, has increased significantly in recent years in detecting and analyzing cracks in concrete structures [7].

In particular, deep learning architectures based on multi-layer artificial neural networks offer powerful and effective solutions for detecting concrete cracks. Thanks to the end-to-end learning paradigm, these algorithms can detect cracks in raw image data directly. Data preprocessing, feature extraction, and classification stages are performed automatically. Previous studies have frequently utilized standard convolutional layers in deep learning architectures based on Convolutional Neural Networks (CNNs) to classify concrete cracks. Russel and Selvaraj [8] introduced the MultiScaleCrackNet network, a deep hierarchical CNN architecture derived from a standard CNN to classify cracks on concrete surfaces. In the related study, it was noted that the computational cost of the model was high. Nyathi et al. [9] developed a customized CNN-based deep learning model using standard convolutional layers to classify concrete cracks. Shashidhar et al. [10] proposed the CrackSpot model for the automatic detection of cracks in concrete structures, which was also developed with standard convolutional layers. In other studies, it is seen that machine learning algorithms (MLAs) maintain their popularity in crack detection. Bhalaji Kharthik et al. [11] performed feature extraction using transfer learning-based deep CNN and used the obtained deep features in the support vector machine algorithm to perform the classification process. Rashid et al. [12] proposed some methods for detecting concrete surface cracks, including artificial neural networks, support vector machines (SVM), k-nearest neighbors (k-NN), and decision trees (DT). This paper proposes three innovative methods for detecting cracks in concrete structures. The initial method introduces an advanced deep learning model integrating CNN with Multi-Head Attention (MHA) mechanisms. This model is called Separable Convolutional with Attention and Multi-layer Enhanced Fusion Network (SCAMEFNet). The SCAMEFNet is optimized with innovative techniques called late fusion and multi-layer feature fusion. In the SCAMEFNet model, CNNs learn local features effectively but have difficulty learning global features due to their limited receptive fields. In contrast, the MHA mechanism can learn global features, general structures, and long-distance dependencies. The model is designed using Depthwise Separable Convolution (DSC) layers as a basis. However, there are also standard convolution layers among the architectural components of the model. DSC technology is a convolution layer with low power consumption and computational cost advantages. However, there is a potential risk of information loss due to the limited channel interaction of DSC. Standard convolution layers are integrated into the architecture to overcome this limitation and optimize the architecture’s performance. The SCAMEFNet framework is constructed utilizing multi-layer feature fusion along with a late fusion strategy. In the late fusion technique, the features obtained in the algorithm are combined in the last layer, while in the multi-layer feature fusion technique, features taken from different layers are combined. These frameworks regulate the flow of information within the model, reduce gradient loss, and enhance the efficiency of the DNN. In the second used method, a modified Vision Transformer (ViT) model is proposed. This model is developed with 16 transformer encoder layers. In the third proposed method, a two-stage ensemble learning model is developed. This model is called a deep feature-based two-stage ensemble model (DFTSEM). In the first stage of the DFTSEM method, deep features are obtained from the SCAMEFNet and ViT deep learning architectures. These features were used in the classification process with Adaptive Boosting (AdaBoost), Random Forest (RF), Logistic Regression (LR), and Hist Gradient Boosting (HGB) machine learning models. The last classification was performed in the second stage using the Hard Voting Classifier, an ensemble learning technique created by combining AdaBoost, RF, LR, and HGB algorithms.

The proposed methods provide various solutions and improvements to the limitations of existing studies in the literature. Using DSC in addition to standard convolution layers in the SCAMEFNet model significantly reduces the number of parameters and computational complexity. This approach provides an effective solution to the high computational cost problem stated by Russel and Selvaraj [8]. Moreover, incorporating MHA into the SCAMEFNet framework enhances the model’s ability to capture local and global features, effectively learn long-range dependencies, and expand its capability for multi-scale feature extraction and contextual data processing. This approach minimizes the limitations of Nyathi et al. [9] and Shashidhar et al.’s [10] standard CNN architectures in extracting wide contextual information and capturing features at different scales, thus providing more effective results. The modified ViT model was used to address the limitations of CNN-based approaches in capturing scale variability and spatial relationships. Finally, the use of ensemble learning technique in DFTSEM model provides a solution with higher generalization capacity by overcoming the limitations of a single MLA. This approach combines the strengths of different models to address the limitations of single-stage hybrid methods and MLAs of Bhalaji Kharthik et al. [11] and Rashid et al. [12], and potentially provides higher classification performance. Machine learning techniques offer a new perspective on the application of mechanical problem-solving and the analysis of complex structures in engineering [13,14].

This study significantly contributes to crack detection in concrete structures by introducing innovative methodologies and resources. It features the creation of the Concrete Cracks Image Data set (CCID), a comprehensive collection of crack images from concrete surfaces with varying textures. This data set is a significant advancement, as it thoroughly evaluates deep learning architectures in real-world conditions. Additionally, the development of SCAMEFNet, a state-of-the-art DL architecture that combines CNN with MHA techniques, demonstrates new approaches that enhance the accuracy and effectiveness of crack detection. The DFTSEM ensemble learning method also offers a perfect perspective on addressing existing performance gaps in this field. This study lays the groundwork for future advancements by tackling the limitations of current techniques and exploring the advantages of modern deep-learning architectures. It will also significantly contribute to infrastructure assessment and safety. Moreover, with the recent transformation of cities into smart cities, easy and fast detection is essential for rapid interventions. The study is a tool for building health and management in these smart city transformations.

This study comprehensively compares current CNN-based technologies to analyze the proposed methods’ performance in crack detection in concrete structures. Within this scope, the performance of the techniques in crack detection was assessed, with a comprehensive analysis of their strengths and limitations. To measure the effectiveness of the deep learning models (DLMs), a new image data set called CCID, which was created by the authors and contains crack images obtained from concrete surfaces belonging to public buildings, was used. Thanks to the cracks in different surface textures, the CCID data set provides the opportunity to more comprehensively evaluate the performance and generalizability of deep learning architectures in real-world conditions. In this direction, a new solution proposal was developed to address the performance gaps of DL models in real-world scenarios and significantly contribute to this field. The principal findings of this study can be summarized as follows.

1) Creation of an image data set consisting of concrete cracks and making it publicly available.

2) Development of SCAMEFNet, an innovative deep learning model developed with two different fusion strategies based on the integration of CNN and MHA techniques.

3) Evaluation of the performance of the modified transformer-based ViT model in crack detection.

4) Evaluation of the DFTSEM method, a two-stage ensemble learning model proposed for crack detection.

5) The performance of SCAMEFNet and ViT models in crack detection with modern CNN-based deep learning architectures was evaluated in detail, in this process, the advantages and limitations of deep learning architectures were determined and an ablation study was performed.

This paper is organized in the following manner. In Section 2, information is shown about the proposed methodologies for the detection of cracks in concrete structures. In Section 3, detailed information about the data set used in the experimental process of this study, application details, evaluation metrics and methods proposed for crack detection and the results of comprehensive experiments performed with modern DL models are presented. In Section 4, the limitations of this article and the conclusions of the article with future studies are given.

2 Proposed approaches

In this part, technical details of the proposed methods are presented: Subsection 2.1 gives the SCAMEFNet deep learning architecture, Subsection 2.2 gives the modified ViT model, and Subsection 3.3 presents the DFTSEM method details.

2.1 SCAMEFNet architecture

Increasing the network depth of DLMs is one of the standard solution techniques for improving the performance of classification/segmentation tasks. Deeper neural networks allow for learning more complex information, generally positively affecting the model’s success in the prediction process. However, the deepening of networks can increase overfitting, complicate model training, and cause gradient loss. Gradients play a vital role in training deep learning architectures through backpropagation. The gradient measures the variation of the loss function (the difference between the actual and predicted value) with respect to the architecture parameters. This information is used to update the parameters of the deep neural network. In the backpropagation process, starting from the output layer, the gradient is calculated for each layer in the network by going backward. This approach can enhance the model’s training, improving prediction results. The vanishing gradient problem occurs when the backward propagating gradient becomes increasingly smaller in the deeper layers of the architecture, approaching zero. This can negatively affect the training by preventing the weights in earlier layers from being effectively updated. To mitigate this issue, techniques such as using activation functions like Rectified Linear Unit (ReLU), Scaled Exponential Linear Unit (SeLU), implementing skip connections (as in ResNet model), or employing Batch Normalization (BN)/Layer normalization (LN) can be effective. Overfitting is a significant issue that can adversely affect the performance of deep neural networks. Overfitting refers to the architecture overfitting the training data, thereby reducing its ability to generalize and find accurate predictions on new, unseen data. It usually occurs due to a limited amount of training data, excessive complexity of the neural network architecture, or increasing model complexity during the training process. To address overfitting, techniques such as data augmentation, dropout, regularization, early stopping can be employed. The limitations experienced in DLMs necessitate the investigation of innovative approaches and advanced algorithmic solutions for optimizing neural network structures. The main motivation for developing the proposed SCAMEFNet deep learning model in this study is to bring an innovative and alternative perspective to the vanishing gradient problem, overfitting, and the increasing complexity problem with deepening the neural network, which is frequently encountered by DLMs. In addition, the proposed deep learning model aims to develop an effective strategy for detecting cracks in concrete structures. The model was developed using the advantages of CNN (standard convolutional and DSC layer) and MHA techniques, as well as the multi-layer feature fusion and late fusion techniques proposed in this study. The innovative architectural design and advanced module integration of SCAMEFNet provide an optimal balance between the depth of the network and the number of parameters, making significant progress in computational efficiency and model performance. The CNN architecture forms the basis of the SCAMEFNet deep learning model. CNNs hierarchically process local spatial information in images with filters in convolutional layers and automatically extract complex patterns. However, traditional CNNs may need help modeling long-range spatial dependencies and capturing global features. The depth of the deep learning model can be increased to overcome these limitations, and techniques such as attention mechanisms or dilated convolutions can be used. However, these approaches generally increase model complexity and computational cost. Especially on devices with limited hardware, energy efficiency, and integration problems may arise. It is critical to focus on lightweight and efficient model architectures to address these challenges. Here, the DSC layer can be used to reduce the computational cost and processing power of CNN-based models developed with standard convolutional layers. This technology offers a practical solution for resource-constrained mobile and embedded systems platforms. As an alternative approach, the MHA mechanism, which is the basic component of the ViT architecture, can be integrated into the model to minimize computational complexity. The DSC technology risks information loss, especially in complex data sets, due to limited channel interactions compared to standard convolutional layers. This can negatively impact model accuracy. In this study, the proposed SCAMEFNet model addresses this issue by using standard convolutional layers along with DSC. This hybrid approach aims to optimize the model performance by minimizing information loss while preserving DSC’s advantages, such as low throughput and a large number of parameters. The MHA mechanism is integrated into the SCAMEFNet model to minimize information loss. MHA increases the generalization capacity of the architecture thanks to its ability to model complex relationships using parallel attention heads. However, the addition of MHA may cause an increase in computational cost. The MHA structure is incorporated into the model using the skip connection approach to address this issue. In this hybrid approach, the input data are first processed by CNN blocks, and the MHA mechanism then processes the resulting feature maps. In the proposed CNN-MHA combination, CNNs learn local spatial features through the convolutional layer, while MHA learns general structures between features by modeling long-range dependencies in these features. This integrated approach improves model performance by optimizing local and global information integration. At the same time, it maintains computational efficiency and regulates gradient flow with a skip connection structure. The effect of the proposed approach with the CNN-MHA combination on model performance is analyzed in the experimental process. Experimental findings show that this approach significantly improves classification performance by increasing the deep learning architecture’s learning capacity. The SCAMEFNet was developed using two different fusion strategies: late fusion and multi-layer feature fusion. In the late fusion technique, the features obtained through the addition (add) layer in the last layer of the model with Cascaded Skip Connection Modules (CSCM) are combined with the Concatenate (concat) layer in the last layer. This approach increases the model’s stability by integrating different features while also reducing processing power. The multi-layer feature fusion approach integrates feature maps from various layers within DLMs. This approach enables the model to leverage both specific and broad information simultaneously. Thus, the model can analyze the small details, complex relationships and large-scale structures of the image more comprehensively. Consequently, the two proposed fusion techniques enhance the deep learning model’s performance, enabling it to extract more effective features. The flow diagram illustrating the fusion strategies is presented in Fig.1.

2.1.1 Background

In this section, theoretical analyses of the main components (standard convolutional layer, DSC layer and MHA) in the SCAMEFNet model are presented.

1) Convolutional neural network and depthwise separable convolution

A Convolutional Neural Network is an artificial neural network specifically built to identify and understand features within images. This is commonly utilized in classification and segmentation tasks across various fields such as image processing, recommendation systems, natural language processing, sound processing, and biomedicine. It consists of convolutional layers (CL), a pooling layer, and a fully connected layer (FCC). CL is a deep learning layer for feature extraction. While the pooling layer is used to reduce the size of feature maps, the FCC performs the classification function of the architecture. CL uses different filter matrices to discover features in the input data. Here, the filter matrix moves to the right and down, starting from the upper left corner of the respective matrices. This process is repeated on all channels of the input data. As a result of the process, a feature map is generated for each channel. Using the backpropagation algorithm during model training reduces errors in the data set, and the model is optimized. DSC is a type of CNN architecture. It is more efficient than traditional convolution layers because it is lighter, requires fewer parameters, and less computational power. DSC consists of two key steps. First, depthwise convolution applies a separate filter to each input channel, efficiently extracting features while reducing computational cost. Second, pointwise convolution uses 1 × 1 filters to model inter-channel relationships and generate the output. This two-step process significantly reduces the number of parameters and computations compared to standard convolution, while maintaining model performance. DSC is particularly effective in resource-constrained environments, such as mobile and embedded systems, where it balances computational efficiency with feature extraction capabilities. The standard convolution formula and computational cost, DSC formula and computational cost, and the comparative mathematical analysis between the two technologies are given in Eqs. (1)–(5) [15]. Standard convolution kernel K (

D K, D K, M, N

) for input image F (

D F, D F, M)

The formula for standard convolution and its computational cost can be expressed as follows:

(1)

G k, l, n = ∑ i, j, m K i, j, m, n × F k + i − 1, l + j − 1, m,

(2)

D K × D K × M × N × D F × D F .

The formula for DSC and its computational cost can be expressed as follows:

(3)

G ̮ k, l, m = ∑ i, j K ˇ i, j, m × F k + i − 1, l + j − 1, m,

(4)

D K × D K × M × D F × D F + M × N × D F × D F .

Computational complexity (calculation of cost ratios) of standard convolution and DSC operations are given below comparatively.

(5)

R (r a t i o) = D K × D K × M × D F × D F + M × N × D F × D F D K × D K × M × N × D F × D F = 1 N + 1 D K 2,

where the R-value has a value of less than one. This shows that as the map size increases, the model developed using the DSC method is higher efficiency and lower cost compared to the models developed with standard convolution [15]. Compared to the standard convolution process, DSC contains fewer parameters, has lower computational costs, and can be considered a helpful method for researchers in tasks such as classification and segmentation.

2) Multi-head attention

Multi-head attention [16] is the core block in transformer models. Multi-head attention mechanism is used in ViT model to more effectively model distant relationships within the image and generate comprehensive feature representations. Multi-head attention consists of more than one head (Query, Key, and Value). It is applied to learn the different features of each title in patches. Here, the headers learn about the various distribution ways of the data. Operations are performed in parallel. This allows the model to pay attention to features in different patches simultaneously. Thus, it can enable the model to capture more complex relationships and discover valuable tacit knowledge. Each head computes attention weights in different representation subspaces, effectively assigning higher weights to more salient information and lower weights to less relevant details. This weighted approach enhances the efficiency of feature learning [17].

2.1.2 Network structures

SCAMEFNet is a deep learning module with a very deep and complex structure. The essential components of this model are as follows. Input and stem block: the model receives 224 × 224 image inputs and three channels. The stem module is designed to extract feature maps from the input and transmit this information/feature to more advanced learning layers. SCAMEF blocks: the main processing unit of the model. CSCM: provides multi-scale feature extraction. Transition layers: they have been added to the architecture to increase the deep learning model’s ability to extract more complex and abstract features. Late fusion: at this stage, features from different layers of the model (add layer, CSCM) are combined. GAP layer: this layer flattens the feature maps and produces a vector output. Output layer: classification is performed with the Softmax function. The network structure of the SCAMEFNet deep learning model is shown in Fig.2.

1) Structural analysis of the stem block

The stem block is the first structural component of the SCAMEFNet architecture, and it was developed using the basic principles of the feedforward neural network. This block extracts feature maps from the input and then passes these features to other layers for deeper and higher-level feature extraction. The structural components of the block include Convolution2D (Conv2D), BN, 2Dimensional (2D) max pooling, and ReLU activation function. In this module, the Conv2D layer consists of 64 filters, a 3 × 3 kernel size, a 3 × 3 stride size, “same” padding, a kernel regularizer (“l2”), parameters, and values. Fig.3 provides a detailed view of the stem layer.

2) Structural and functional analysis of the SCAMEF block

The SCAMEF block is the main framework component of the SCAMEFNet DLMs. This block represents an advanced variation of CNN and consists of two submodules: Convolutional Block and Separable Convolutional Block. The block was developed by combining standard, separable, and depthwise convolution types. This structure aims to increase the network’s learning capacity by combining different convolution types and computational strategies while optimizing the number of parameters. Separable convolutions remarkably reduce the computational cost of the architecture compared to standard convolutions. This is achieved by performing the convolution process in two stages: depthwise convolution and pointwise convolution. Depthwise convolutions filter each channel separately, which reduces the connections between channels. This structure reduces the number of parameters, computational cost, and architecture overfitting risk. Separable and depthwise approaches increase the computational efficiency of the architecture while also improving the performance of the architecture. The SCAMEF block is enhanced with an advanced skip connections structure. This structure regulates the gradient flow and increases the training stability of deep networks. The SCAMEFNet model has multiple SCAMEF blocks, which allows for the extraction of multi-level and complex features from the input data. Each block processes the output of the previous block(s) to obtain more abstract and high-level features. This increases the capacity of the deep learning model to analyze complex data structures. As a result, the SCAMEF block performs critical functions of the SCAMEFNet model, such as multi-level feature extraction, increasing model performance, and computational efficiency. The structural component of the block includes; Conv2D layer which consists of 3 × 3 kernel size and different filter values (e.g., 24, 18, 8, etc.), DepthwiseConv2D which consists of 1 × 1 kernel size and 32, 28, 16, 8 filter values, SeparableConv2D which consists of 1 × 1, 3 × 3 kernel size and different filter values (e.g., 64, 96, 128, etc.), add layer, concat layer, BN, ReLU activation function, convolutional blocks and separable convolutional blocks. The network structure of the SCAMEF block is shown in Fig.4(a). Convolutional and Separable Convolutional blocks form two SCAMEF block submodules with similar structural features. Both blocks are developed using the basic principles of feedforward neural networks and the skip connection structure. These blocks increase the ability of the deep learning model to extract higher-level, abstract, and complex features. Its structural components include Conv2D/SeparableConv2D, ReLU activation function, BN, and add layer. Both blocks use different filter numbers (64, 32, 24, 24) and kernel sizes (1 × 1, 3 × 3). The main difference between the two blocks is that while standard convolution is applied in the Convolutional block, a more efficient, separable convolution technology is used in the Separable Convolutional block. This method increases the efficiency of the Separable Convolutional block by reducing the computational cost and model parameters. The network architecture of the two blocks is given in Fig.4(b) and Fig.4(c), respectively.

3) Structural analysis of the transition layer

The transition layer is developed using a feed-forward neural network and skips the connection structure. This layer processes high-level features obtained from SCAMEF blocks and creates more abstract representations of complex patterns. This approach allows the SCAMEFNet model to solve complex problems effectively. The structural components of the Transition layer include Conv2D/SeparableConv2D/DepthwiseConv2D, the ReLU activation function, BN, 2D upsampling, 2D max pooling, and an add layer. The network structure of the Transition layer is shown in Fig.5.

4) Cascaded skip connection modules

The CSCM block was developed with an innovative integration of CNN and MHA methods. In this block, the model’s input data are created using the feature matrices obtained from the last add layer and the first three transition layers in the SCAMEFNet architecture. The input data are processed primarily through convolutional layers (single or multiple SeparableConv2D layers) that have a feedforward and hierarchical structure. These layers extract local features in the image. The feature matrices obtained by convolutional operations are processed with BN and ReLU activation functions. BN normalizes the feature matrices and organizes the data distribution. The ReLU enables learning nonlinear features. After this process, the three-dimensional feature matrices obtained from the convolutional layers are converted to a two-dimensional format to operate the MHA mechanism. The MHA calculates the relationship of each feature with other features by applying an attention mechanism on the feature matrix. Thanks to this process, long-distance relationships and dependencies between features at distant locations can be detected. After the MHA process, the feature matrix is converted back to a three-dimensional format. In the last stage, the output of the CSCM block is obtained by applying LN. Each component in the CSCM block combines different feature extraction techniques to increase the SCAMEFNet deep learning model’s ability to learn complex data structures and significantly contribute to its performance. The CSCM network structure is shown in Fig.6.

2.2 Modified vision transformer

In this study, a modified ViT model with 16 Transformer encoder layers was developed as the second method proposed for detecting cracks in concrete structures. The ViT is a deep learning-based architecture proposed by Dosovitskiy et al. [18]. This model has achieved successful results in image classification tasks and offers significant advantages over traditional CNN-based architectures [19]. Both architectures are capable of processing large-scale image data. These two structures differ in terms of architecture. CNN models generally extract features from input data using convolution layers, pooling layers, activation functions, and normalization layers. In this process, feature maps are created using filters, and these maps are optimized using the backpropagation algorithm during training. CNNs are pretty effective in capturing local features but may be limited in learning long-range dependencies and capturing global features. The ViT model adopts an attention mechanism-based approach to process image data. The architecture divides the input image into small pieces, and each piece is passed through Transformer encoder blocks. These encoder blocks are developed based on the residual network principle and include components such as MHA, LN, and Multilayer Perceptron (MLP). MHA learns the relationships of each image piece with other pieces and thus effectively captures long-distance dependencies. Normalization layers and activation functions stabilize the model’s learning process, prevent gradient vanishing, and increase the neural network’s ability to capture complex patterns. The MLP layer is usually structured according to the feedforward neural network methodology. The architecture uses the MLP head layer after the Transformer encoder step. This layer ensures that the classification process is performed in the output layer. A modified ViT model was used in the experimental process.

In this model, images are divided into fixed-size pieces in the first step. In the second step, these image patches are processed by Transformer encoder blocks, and features are learned from each piece. In the last step, the classification process is performed with a Softmax layer following an MLP block added to the architecture. ViT architecture consists of normalization, resizing, patches, patch encoder, transformer encoder (MHA, LN, MLP (Dense, BN, ReLU activation function, dropout), add layer) flatten, dropout, MLP (Dense, BN, ReLU activation function, dropout) and Softmax output layer. The network structure of the modified ViT architecture is presented in Fig.7.

2.3 Deep feature-based two-stage ensemble model

In this study, the third method proposed for detecting cracks on concrete surface was developed as a two-stage ensemble learning model, the DFTSEM model, which includes deep features. in this approach, deep learning, machine learning, and ensemble learning techniques were combined. The application was developed using Scikit-learn 1.4.2 and Python 3.12.4 release packages. An ensemble learning model is a machine learning method designed to achieve high performance by effectively integrating multiple similar or diverse classifiers [20]. Many studies have shown that the ensemble learning method is widely used in different problems and performs better than a single classifier model in the prediction process [20]. In this study, the voting classifier ensemble learning model was preferred. A voting classifier is a method that combines different classifiers and makes predictions according to the voting principle. Voting Classifiers are of two different types: “soft” and “hard.” In the soft technique, the sum of the prediction probabilities for each class is calculated. The final prediction is selected as the class with the highest total probability. In the hard technique, the classifiers’ prediction labels are taken into account, and the class label with the most votes is determined as the final prediction.

In this study, the hard voting classifier was used. In the process of the DFTSEM model, feature vectors obtained from SCAMEFNet and ViT models were used. The study was carried out using similar methodologies for the two methods. In the first stage of the DFTSEM model, feature extraction was performed through DLMs. The obtained features were classified using AdaBoost, RF, LR, and HGB classifiers. The optimal hyper-parameters of the MLAs used in this stage are given in Tab.1. In the second stage, a Voting Classifier was created using AdaBoost, RF, LR, and HGB classifiers, and the final classification process was performed through this ensemble model. The structure of the DFTSEM model is given in Fig.8.

3 Experiments and experimental results

This section gives a comprehensive analysis of the data set of concrete cracks collected by the authors. Data preprocessing techniques are discussed in detail, and detailed information on deep learning applications is provided. Also, the metrics used to evaluate the classification performance of deep learning architectures are explained. At the end of the section, experimental results are presented comprehensively.

3.1 Data set and preprocessing

This study presents a new CCID [21] collected by the authors. Data were collected at Gümüşhane University, Faculty of Engineering and Natural Sciences, Türkiye, using Samsung Galaxy M31 and Samsung Galaxy A50 Android smartphones. A total of 2000 images were recorded. The data set was divided into “No crack” and “Crack”. The resolutions of the images vary between 1860 × 4032 and 1504 × 3264 pixels. All images were saved in JPG format to ensure data integrity and standardize processing. In the experimental process, 1,400 images were used for training, and 600 images were used as test data. Other details are shown in Tab.2. Example images in the CCID data set are given in Fig.9. In the pre-processing step, all data were resized to 224 × 224 pixels, the input layer size of DLMs. In the resizing process, the interpolation (inter cubic) technique is used. In addition, during preprocessing, the pixel values in the data were scaled in the range 0–1 and then labels were assigned. Here, the “0” label is defined for no crack images, and the “1” tag is defined for crack images.

3.2 Implementation details

In this study, NASNetMobile [22], InceptionV3 [23], InceptionResNetV2 [24], MobileNet [25], MobileNetV2 [26], DenseNet-169, and -201 [27], ResNet-101, and -152 [28], ResNet-101V2, and -152V2 [29], Efficient Net-B0, -B1, -B2, -B3, -B4, and -B5 [30], ViT ve SCAMEFNet architectures, which are state-of-the-art deep learning architectures, have been used for concrete crack detection. The deep learning architectures used in this research were initialized with random parameters (e.g., weights, biases, etc.) and trained from scratch. Architectures initialized with random weights were trained with the relevant data set (inputs and labels) and used to discover relationships and patterns between input and output. To keep the performance of deep learning architectures at the highest level, the parameters of the architectures are adjusted during the training phase. Tensorflow framework [31] was used in the development and implementation of DLMs. The experimental process was carried out in the Google Colab environment. The system features in this version are Tesla T4 GPU, 15 GB of graphics memory, 12.7 GB RAM, and 78.2 GB hard disk space. The study was carried out using the Python programming language. Hyper-parameters used in deep learning architectures have been determined as a result of empirical investigations carried out by trial and error method. All deep learning architectures are trained under the same conditions with training and validation data sets for 50 iterations. After the training, the architecture’s performance with optimal parameters was checked using the test data set (training-validation-test approach). During the training period, loss function categorical cross-entropy loss, optimization algorithm Adam function [32], accuracy metric, and 1.0e–3 initial learning rate were used to evaluate the model. Here, the Adam function is a common and most effective method for minimizing cross-entropy loss and gradient descent optimization. All images are resized to a dimension of 224 × 224 during the preprocessing (resize, normalize, image interpolation) step. Other hyper-parameters include the batch size (32), factor (0.5), patience (5), monitor (validation loss), save best only (true), mode (min), minimum learning rate (1.0e–4). The CCID data set, consisting of no crack/crack images, has been divided into 70% training (validation data (30% of the training data set)) and 30% test data set.

DLMs were trained with training and validation data sets. After the training process was completed, the performance of the models was evaluated on the test data set. In the training step of deep learning architectures, the reduction of error values (loss and validation loss) to the minimum level and the maximum level of accuracy and validation accuracy are essential parameters in the high success rate of the architectures in the training/test processes. In this direction, to obtain the minimum validation loss value during the training phase of the architectures in the experimental process, the initial learning rate was reduced in cases where learning stagnated. In cases where the validation loss value did not decrease during the 5 (patience: 5) cycles during the training of the architectures, the initial learning rate value was reduced by the factor (0.5) parameter. This process lasted up to the minimum learning rate (1.0e–4) and 50 cycles. Here, the change in learning rate was calculated as in Eq. (6).

The experimental process trained deep learning architectures under the same conditions for 50 epochs. TensorFlow’s ModelCheckpoint and ReduceLROnPlateau functions were used during training. Although a model was obtained at the end of each epoch for each model, only the model with the lowest validation loss value according to the save-best-only and mode parameters was saved with the ModelCheckpoint function. ReduceLROnPlateau dynamically adjusted the learning rate. Finally, the version of the model with the minimum validation loss value was evaluated on the test data set.

(6)

n e w l e a r n i n g r a t e = i n i t i a l l e a r n i n g r a t e × f a c t o r .

3.3 Evaluation metrics

In this article, accuracy (ACC), Positive Predictive Value (PPV), Sensitivity (SN), F-1 score (F-1), Receiver Operating Characteristic-Area Under the Curve (ROC AUC), and confusion matrix (CM) are used to perform quantitative analysis of deep learning architectures. The CM is an important table that visualizes the performance of deep learning architectures, determined by comparing the actual and predicted class labels. The CM includes True Negative (TN), False Positive (FP), False Negative (FN), and True Positive (TP). TP: it is determined that there is a crack in the structure as a result of the estimation that there is actually a crack in the relevant structure. TN: it is determined that there is no crack in the appropriate structure, and the result of the estimation is that there is no crack in the structure. FP: it means that there is no crack in the relevant form, but it is determined that there is a crack in the structure due to incorrect inference. FN: it is the determination that there is a fracture in the relevant structure, but as a result of an incorrect inference, there is no fracture in the structure. The mathematical expressions of the metrics used in the article are given in Eqs. (7)–(10). The evaluation metrics: “ACC” is the ratio of the number of correctly predicted samples to the total number of samples in the prediction study with deep learning architectures; “PPV” is the ratio of the number of samples predicted correctly as a result of the estimation to the number of samples predicted positively during the estimation process, “SN” the ratio of the number of correctly predicted samples to the number of correctly positive and false negative predicted samples, “F-1” is the weighted harmonic mean of the PPV and SN metrics.

(7)

A C C = T P + T N T P + T N + F P + F N,

(8)

P P V = T P T P + F P,

(9)

S N = T P T P + F N,

(10)

F − 1 = 2 × P P V × S N P P V + S N .

3.4 Results and analysis

This study has comprehensively evaluated the performance of the proposed methodologies with modern deep learning algorithms in detecting cracks in concrete structures in the CCID data set. This section presents the experimental results and their analysis in detail. The proposed methods and modern deep learning algorithms have been comparatively examined under the same conditions in the evaluation process. The experimental findings reveal the advantages of the proposed methodologies in terms of performance, the effectiveness of deep learning-based approaches, and the limitations of these methods.

3.4.1 Experimental results of the proposed deep learning models

This part evaluates the performance of the proposed deep-learning architectures in detecting cracks in concrete structures. In addition, the effects of the CSCM module, one of the main components of the SCAMEFNet deep learning model, on the model performance are examined. In addition, the effect of the architecture of the transformer encoder layers in the ViT model on the classification performance is also analyzed. The findings are presented in Tab.3. Evaluation metrics such as ACC, PPV, SN, and F-1 were used to analyze the methods’ performance. In addition, to compare the computational and time complexity of the methods, the structural features of the models, such as the average epoch time, the total number of parameters, and the total number of layers, were also examined. These findings provided a comprehensive evaluation in terms of the accuracy performance and computational complexity of the models. While the SCAMEFNet model offers a very deep structure with 1709 layers and a sophisticated architecture with 6.97 million parameters, the ViT model has a higher level of complexity, with 224 layers and 11.90 million parameters.

In the evaluation made with the SCAMEFNet model, the most successful result was obtained with a 98.83% ACC metric value. The average epoch time of the SCAMEFNet model during the training period is 148.60 s. In the SCAMEFNet model, the cases where CSCM modules were removed caused a decrease in the classification performance compared to the original model. According to the experimental findings, an accuracy decrease of 0.83% was observed when the CSCM-1 block was removed from the model, 1.16% when CSCM-1 and CSCM-2 were removed together, and finally, 1.16% when the CSCM-1, CSCM-2, and CSCM-3 modules were removed. These results show that the CSCM block developed by integrating CNN and MHA increases model accuracy. When the effect of changes in the number of transformer encoder layers of the ViT model on model accuracy is examined, the ViT-B16 model produced the most successful result, obtaining a 97.33% ACC value with 16 encoder layers. The average epoch time of the ViT model during the training period is 4.88 s. The accuracy showed an increasing trend when the number of encoder layers was 4, 8, or 16. However, a decrease in accuracy was detected when the number of layers was increased from 16 to 32. Fig.10 shows the comparative scattering diagram of the ACC, PPV, SN, and F-1 evaluation metrics together with the validation loss, validation accuracy, and ROC curve graphs of the SCAMEFNet and ViT models.

Fig.10(a) shows the change in validation loss values of SCAMEFNet and ViT models. The graph shows that significant fluctuations in validation loss values were observed in both models at the beginning of the training process. This situation shows that the models needed more transparency in the early stages. However, as the training progressed, these fluctuations decreased, and it was observed that the models entered the convergence process. The graph shows that the SCAMEFNet model produced lower validation loss values compared to the ViT models, especially from the 20th cycle onwards. According to the experimental results, the ViT-B8 model produced the highest validation loss value, while the SCAMEFNet wo (CSCM-1, CSCM-2, CSCM-3) model obtained the lowest value. An additional graph is presented to visualize the validation loss changes of the models over 40–50 iterations more clearly. In this detailed analysis, the ViT-B8 model produced the highest validation loss value, and the SCAMEFNet wo (CSCM-1, CSCM-2) model produced the lowest. After training DLMs, the testing process was carried out using the best model with the lowest validation loss value obtained (optimal model). According to the findings, the SCAMEFNet model obtained its optimal parameters in the 28th cycle, producing 0.1045 validation loss and 0.9786 validation accuracy values in the 23rd cycle. The ViT model obtained its optimal parameters in the 11th cycle, producing 0.2140 validation loss and 0.9572 validation accuracy values. Fig.10(b) shows the changes in validation accuracy values of SCAMEFNet and ViT models. According to the graph, the SCAMEFNet model experienced more significant changes in validation accuracy values than the ViT model. These changes decreased in the SCAMEFNet model starting from the 29th cycle, and higher validation accuracy values were generally obtained compared to the ViT model. According to the findings, the lowest validation accuracy value was obtained from the ViT-B32 model with 0.5059, while the highest was produced by the SCAMEFNet and SCAMEFNet wo (CSCM-1, CSCM-2) models with 0.9810. Fig.10(c) shows the AUC results of SCAMEFNet and ViT models. The SCAMEFNet model achieved the highest success based on the experimental results, demonstrating an AUC value of 0.9882. The model with the lowest performance was the ViT-B4 model, with an AUC value of 0.9614. The average AUC value of all models was calculated as 0.9736. Fig.10(d) shows the ACC, PPV, SN, and F-1 evaluation metric results of the SCAMEFNet and ViT models. According to this graph, SCAMEFNet was the most successful model in terms of ACC, SN, and F-1 metrics. The SCAMEFNet wo (CSCM-1, CSCM-2, CSCM-3) was determined to be the most successful model in terms of the PPV metric.

3.4.2 Comparative evaluation of deep learning models

This study thoroughly assessed the performance of the SCAMEFNet and ViT models using the CCID data set. To provide an objective evaluation of these models’ effectiveness in crack detection, we compared their experimental results with those of 17 other modern DLMs based on CNNs, all tested on the same data set. This in-depth evaluation highlights the strengths and drawbacks of the SCAMEFNet and ViT models compared to current approaches. Tab.4 gives the experimental results of the proposed and modern DLMs. The relevant table also includes structural features such as average epoch times, the number of parameters, and the number of layers to compare DLMs’ computational and time complexities. According to the findings, SCAMEFNet had the highest performance, with a 98.83% ACC rate, while the MobileNetV2 model had the lowest performance, with a 51.33% ACC rate. In addition, the SCAMEFNet model produced the most successful results regarding SN and F-1 metrics. The second proposed model, ViT, succeeded with a 97.33% ACC rate. In the temporal complexity analysis of deep learning algorithms, the SCAMEFNet model had the highest computational time, 148.60 s, while the second proposed method, the ViT model, showed the lowest computational time, 4.88 s. When evaluated in terms of model architecture, although SCAMEFNet is the deepest model with 1709 layers, it showed an effective parameter efficiency with only 6.97 million parameters. This indicates that the model has successfully balanced computational performance, depth, and parameter efficiency and offers a more efficient architecture than modern, advanced models. Among the DLMs, ResNet152 had the highest total parameters, with 58.38 million parameters, while the MobileNet architecture had the lowest parameters, with 3.23 million.

The second proposed model, ViT architecture, shows a high complexity level with 11.90 million parameters despite having 224 total layers. When the ViT model is compared to the closest CNN-based model, EfficientNetB0, with 240 layers, in terms of the total number of layers, it has been determined to have a higher parameter count with 11.90 million parameters. This situation shows that transformer-based models are more computationally complex than CNN-based models. However, the SCAMEFNet model, although developed as transformer-based, is structured very effectively in terms of computational efficiency. The closest model to the SCAMEFNet model in terms of the number of layers is InceptionResNetV2, which has a total number of layers of 782. The InceptionResNetV2 model produced an ACC rate of 97.67% with 54.34 million parameters. In contrast, SCAMEFNet achieved 98.83% classification accuracy with only 6.97 million parameters despite its 1709-layer deep structure. These findings show that SCAMEFNet performs in terms of parameter efficiency and generalization capacity. Fig.11 shows the validation loss, validation accuracy, and ROC curve graphs of DLMs, along with the comparative scattering diagram of ACC, PPV, SN, and F-1 evaluation metrics. Fig.12 shows the CM showing the performance of the proposed SCAMEFNet, InceptionV3, DenseNet169, and ResNet101V2 DLMs on the CCID data set’s test data.

Fig.11(a) gives the changes in validation loss values of SCAMEFNet, InceptionV3, DenseNet169, ResNet101V2, MobileNet, EfficientNetB2, and NASNetMobile DLMs during the training process. In the graph, the validation loss changes of the other models could not be observed clearly due to the high validation loss value of the ResNet101V2 model. An additional graph is provided to clearly visualize the validation loss changes of the models between 40 and 50 iterations. In the detailed analysis, ResNet101V2 produced the lowest validation loss value, and EfficientNetB2 produced the highest. According to the findings, the optimal models with the lowest validation loss values used in the testing process of SCAMEFNet, InceptionV3, DenseNet169, ResNet101V2, MobileNet, EfficientNetB2, and NASNetMobile DLMs were obtained in 28, 32, 26, 31, 40, 40, and 33 cycles, respectively. Fig.11(b) shows the change in validation accuracy values of DLMs during the training process. When the graph is examined, decreasing and increasing trends in validation accuracy values are observed throughout the training process. In the last ten epochs, the fluctuation in validation accuracy of SCAMEFNet, InceptionV3, DenseNet169, and ResNet101V2 models decreased, indicating that the models exhibited more stable learning and improved generalization performance toward the end of training. Fig.11(c) shows the AUC results of DLMs. According to the experimental findings, the SCAMEFNet model was determined to be the most successful, with an AUC value of 0.9882; the model with the lowest performance was NASNetMobile, with an AUC value of 0.8035. The average AUC value of all models was calculated as 0.9482. Fig.11(d) shows the graph of DLMs’ ACC, PPV, SN, and F-1 evaluation metric results. According to this graph, SCAMEFNet was the most successful model in terms of ACC, SN, and F-1 metrics. Regarding the PPV metric, the most successful models were determined to be InceptionV3 and ResNet101V2.

Fig.12(a) presents the CM results of the SCAMEFNet model. This model misclassified only 7 out of 600 test cases. Fig.12(b) presents the results of the InceptionV3 model, which misclassified 10 out of 600 test cases on the same data set. Fig.12(c) presents the results of the DenseNet169 model. This model misclassified 12 out of 600 test cases. Fig.12(d) presents the results of the ResNet101V2 model, which similarly misclassified 12 out of 600 test cases. The SCAMEFNet, InceptionV3, DenseNet169, and ResNet101V2 models had the most difficulty classifying the “Crack” class cases.

3.4.3 Experimental results of the proposed DFTSEM method

This section provides a detailed assessment of the effectiveness of the proposed DFTSEM method in identifying cracks in concrete structures. Experimental results are given in Tab.5. The DFTSEM method consists of two stages. In the first stage, feature extraction was performed using SCAMEFNet and ViT models, and the obtained feature vectors were classified using AdaBoost, RF, LR, and HGB classifiers. The features obtained with SCAMEFNet and ViT models are 780 and 4608 in size, respectively. In the first stage, the experiments conducted with SCAMEFNet and MLAs obtained a 98.83% ACC rate. In the studies conducted with the ViT model, a 96.83% ACC rate obtained using the AdaBoost algorithm was determined as the most successful result. The classification study was conducted in the second stage with the Voting Classifier created using AdaBoost, RF, LR, and HGB classifiers. According to the experimental findings, the ViT + Ensemble Learning method achieved an ACC rate of 95.83%, while the SCAMEFNet + Ensemble Learning method achieved an ACC rate of 99.00%. According to the experimental results, the SCAMEFNet + Ensemble Learning combination showed the highest classification performance in this study. Fig.13 shows the CM showing the proposed DFTSEM method’s performance on the CCID data set’s test data.

Fig.13(a) presents the CM results of the SCAMEFNet + Ensemble Learning method. This method misclassified only 6 out of 600 test samples. Fig.13(b) presents the results of the ViT + Ensemble Learning method, which misclassified 25 out of 600 test samples on the same images. The SCAMEFNet + Ensemble Learning method had the most difficulty classifying the “Crack” samples, while the ViT + Ensemble Learning method had the most difficulty classifying the “Without Crack” samples.

4 Conclusions

This study proposes innovative approaches, such as the SCAMEFNet model, ViT model, and DFTSEM method, for detecting cracks in concrete structures. The study used the Concrete Cracks Image Data set, a new data set obtained from the concrete surfaces of public buildings and containing crack images. The SCAMEFNet was developed by applying fusion techniques and integrating CNN and MHA mechanisms. The model shows a significant increase in depth compared to traditional DLMs with its 1709-layer architecture. Despite this complex structure, the model is not affected by common problems such as overfitting, underfitting, or vanishing. The SCAMEFNet, which contains 6.97 million parameters, establishes an optimal balance between computational efficiency and high performance. The model’s effectiveness is verified with validation loss and validation accuracy graphs and test results. In addition, the study examined the transformer-based modified ViT algorithm, which generally achieved successful results compared to traditional CNN architectures. The two-stage ensemble learning model DFTSEM (SCAMEFNet + Ensemble Learning) achieved the highest classification accuracy (99.00%). In comparison, SCAMEFNet alone outperformed 17 other state-of-the-art deep learning algorithms with an accuracy rate of 98.83%. The experimental findings highlight SCAMEFNet’s successful performance in concrete crack detection and its significant contribution to the field of deep learning. They also prove the effectiveness of the DFTSEM method in crack detection.

In future studies, the proposed methods can be tested in practical applications by integrating them into real-world systems such as smartphones, tablets, and UAVs. Integrating RNN techniques like LSTM and GRU, along with dilated convolutions, into the SCAMEFNet deep learning model has the potential to enhance its performance. In addition, pruning techniques can reduce the size of the SCAMEFNet model and increase its computational efficiency. Binary and multiple classification tasks can be performed on different data sets to test the generalizability of the proposed methods. The number of images in the CCID data set can be increased, or new categories can be created according to the crack severity. In this way, the effectiveness of the models in crack detection can be tested more deeply. Finally, the proposed SCAMEFNet model was trained and tested from scratch. To increase the model’s classification performance, it can be pre-trained with ImageNet within the scope of transfer learning, and the classification process can be performed in this way.

This method produces successful results in various disciplines. In today’s rapidly increasing urbanization, the study can provide high-speed data for countries where plaza settlement is widespread and concrete construction is common. In addition, the study produces an economical solution in concrete structure health and management. This building monitoring and health is important for countries that aim for smart cities transformation and sustainability. Thus, the study will be able to make significant contributions to managers and owners by integrating reinforced concrete structures, especially public buildings, into smart cities-BIM platforms.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Spencer B F Jr, Hoskere V, Narazaki Y. Advances in computer vision-based civil infrastructure inspection and monitoring. Engineering, 2019, 5(2): 199–222

[2]	Andrushia A D, Anand N, Neebha T M, Naser M Z, Lubloy E. Autonomous detection of concrete damage under fire conditions. Automation in Construction, 2022, 140: 104364

[3]	Han X, Zhao Z, Chen L, Hu X, Tian Y, Zhai C, Wang L, Huang X. Structural damage-causing concrete cracking detection based on a deep-learning method. Construction & Building Materials, 2022, 337: 127562

[4]	Asadi Shamsabadi E, Xu C, Rao A S, Nguyen T, Ngo T, Dias-da-Costa D. Vision transformer-based autonomous crack detection on asphalt and concrete surfaces. Automation in Construction, 2022, 140: 104316

[5]	Rao A S, Nguyen T, Palaniswami M, Ngo T. Vision-based automated crack detection using convolutional neural networks for condition assessment of infrastructure. Structural Health Monitoring, 2021, 20(4): 2124–2142

[6]	Liu F, Liu J, Wang L. Deep learning and infrared thermography for asphalt pavement crack severity classification. Automation in Construction, 2022, 140: 104383

[7]	Katsigiannis S, Seyedzadeh S, Agapiou A, Ramzan N. Deep learning for crack detection on masonry façades using limited data and transfer learning. Journal of Building Engineering, 2023, 76: 107105

[8]	Russel N S, Selvaraj A. MultiScaleCrackNet: A parallel multiscale deep CNN architecture for concrete crack classification. Expert Systems with Applications, 2024, 249: 123658

[9]	Nyathi M A, Bai J, Wilson I D. Deep learning for concrete crack detection and measurement. Metrology, 2024, 4(1): 66–81

[10]	Shashidhar R, Manjunath D, Shanmukha S M. CrackSpot: Deep learning for automated detection of structural cracks in concrete infrastructure. Asian Journal of Civil Engineering, 2024, 25(1): 1079–1090

[11]	Bhalaji Kharthik K S, Onyema E M, Mallik S, Siva Prasad B V V, Qin H, Selvi C, Sikha O K. Transfer learned deep feature based crack detection using support vector machine: A comparative study. Scientific Reports, 2024, 14(1): 14517

[12]	RashidTMokji M MRasheedM. Cracked concrete surface classification in low-resolution images using a convolutional neural network. Journal of Optics, 2024

[13]	Rabczuk T, Ren H, Zhuang X. A nonlocal operator method for partial differential equations with application to electromagnetic waveguide problem. Computers, Materials & Continua, 2019, 59(1): 31–55

[14]

Samaniego E, Anitescu C, Goswami S, Nguyen-Thanh V M, Guo H, Hamdia K, Zhuang X, Rabczuk T. An energy approach to the solution of partial differential equations in computational mechanics via machine learning: Concepts, implementation and applications. Computer Methods in Applied Mechanics and Engineering, 2020, 362: 112790

[15]	Hong G, Chen X, Chen J, Zhang M, Ren Y, Zhang X. A multi-scale gated multi-head attention depthwise separable CNN model for recognizing COVID-19. Scientific Reports, 2021, 11(1): 18048

[16]	VaswaniAShazeer NParmarNUszkoreitJJonesL GomezA NKaiser LPolosukhinI. Attention is all you need. In: Advances in neural information processing systems 30 (NIPS 2017). Long Beach, CA: NIPS, 2017, 5998–6008

[17]	Cheng S, Liu Y. Research on transportation mode recognition based on multi-head attention temporal convolutional network. Sensors, 2023, 23(7): 3585

[18]	DosovitskiyABeyerLKolesnikov AWeissenbornDZhaiXUnterthiner TDehghaniMMindererMHeigoldG GellySUszkoreit JHoulsbyN. An image is worth 16 × 16 words: Transformers for image recognition at scale. 2020, arXiv: 2010.11929

[19]	Azad R, Kazerouni A, Heidari M, Aghdam E K, Molaei A, Jia Y, Jose A, Roy R, Merhof D. Advances in medical image analysis with vision transformers: A comprehensive review. Medical Image Analysis, 2024, 91: 103000

[20]	Hou S, Liu Y, Yang Q. Real-time prediction of rock mass classification based on TBM operation big data and stacking technique of ensemble learning. Journal of Rock Mechanics and Geotechnical Engineering, 2022, 14(1): 123–143

[21]	ReisH CTurk VBozkurtM FYigitS N. Concrete Cracks Image Dataset (CCID). 2024 (available at the website of Mendeley Data v1)

[22]	ZophBVasudevan VShlensJLeQ V. Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Salt Lake City: USA IEEE/CVF, 2018, 8697–8710

[23]	SzegedyCVanhoucke VIoffeSShlensJWojnaZ. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Las Vegas: USA IEEE/CVF, 2016, 2818–2826

[24]	Szegedy C, Ioffe S, Vanhoucke V, Alemi A. Inception-v4, inception-resnet and the impact of residual connections on learning. Proceedings of the AAAI Conference on Artificial Intelligence. San Francisco, CA: AAAI Press, 2017, 31(1): 4278–4284

[25]	HowardA GZhu MChenBKalenichenkoDWangW WeyandTAndreetto MAdamH. Mobilenets: Efficient convolutional neural networks for mobile vision applications. 2017, arXiv: 1704.04861v1

[26]	SandlerMHoward AZhuMZhmoginovAChenL C. Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Salt Lake City: USA IEEE/CVF, 2018, 4510–4520

[27]	HuangGLiu ZVan Der MaatenLWeinbergerK Q. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Honolulu: USA IEEE/CVF, 2017, 4700–4708

[28]	HeKZhangX RenSSunJ. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. Las Vegas: USA IEEE/CVF, 2016, 770–778

[29]	HeKZhangX RenSSunJ. Identity mappings in deep residual networks. In: Proceedings of Computer Vision-ECCV 2016. Cham: Springer International Publishing, 2016, 630–645

[30]	TanMLeQ. Efficientnet: Rethinking model scaling for convolutional neural networks. 2019, arXiv: 1905.11946

[31]	AbadiMBarham PChenJChenZDavisA DeanJDevin MGhemawatSIrvingGIsardM, . TensorFlow: A System for Large-Scale Machine Learning. 2016, arXiv: 1605.08695

[32]	KingmaD PBa J. Adam: A method for stochastic optimization. 2014, arXiv: 1412.6980

RIGHTS & PERMISSIONS

Higher Education Press

AI Summary AI Mindmap

PDF (3397KB)

593

Accesses

Citation

Detail

Sections

Recommended

Received	Accepted	Published
2025-02-01	2025-04-15
Issue Date	Revised Date
2025-07-16

About the journal

Aims & scope

Description

Editorial board

Contact us

Latest issue

Just accepted

Collections

Authors & reviewers

Online submisson

Call for papers