1 Introduction
Steel is widely used in the engineering fields such as civil, aerospace, and nuclear [
1–
3], and is important within the context of life-cycle benefits estimation [
4]. Steel members are usually connected by welds, bolts and rivets. Compared with the bolted and riveted solutions, the steel weld technique has unparalleled advantages, such as the geometric flexibility, high stiffness, and the full usage of the steel base material. Nonetheless, the mechanical properties of the welded connections can be reduced by the weld defects. The commonly observed types of weld defect include crack, lack of fusion (LF), porosity (P), and slag inclusion (SI), whose examples are illustrated in Fig.1 [
5,
6].
Therefore, detection of the steel weld defects becomes an important step in quality control and structural health monitoring procedures. Many approaches were developed to detect the steel weld defects. For example, radiographic images and X-Ray photos have been used to classify weld defects. These image data could be processed [
7–
10] to achieve high classification accuracy and versatility. Additionally, automatic weld detection algorithms such as the autonomous method based on the multi-class support vector machine (SVM) in X-ray images [
9] were also developed. Another available technique is the magneto-optical imaging-based approach, focusing on process methods [
10], and effectiveness [
11]. Furthermore, the infrared thermography-based approach [
12], wave information detection (e.g., using guided waves [
13] or ultrasonic waves [
14]) are also trending approaches.
It is noteworthy that, compared with the aforementioned approaches that rely on various data collection equipment, methods based on sound signals that are less dependent on equipment have demonstrated their effectiveness in multiple fields. In developing and exploring this technology, first, multiple works were carried out to investigate the features associated with environmental sounds (e.g., siren, street music, gun shot, dog bark and children playing) [
15], animal sounds [
16], heartbeat sounds [
17], snoring sounds [
18], and engine vibration sound signals [
19].
In addition, percussion sounds can be used to define or diagnose the internal properties of an object. For example, percussion methods were used in medical field for diagnosis for many decades [
20,
21]. In the field of Civil Engineering, the percussion method was implemented to detect cup-lock scaffold looseness [
22], to detect subsurface voids in concrete-filled steel tubular structures [
23], to predict the shear load capacity of steel bolts [
24], and to reconstruct sound to identify bolt looseness [
25].
Together with these sound-based approaches, machine learning (ML) algorithms were applied to complete engineering challenges [
26–
29]. Examples included applications of random forest (RF) [
30], k-nearest neighbors (KNN), SVM [
23], and hidden Markov model (HMM) [
31]. Recently, artificial neural networks (ANN), which are represented by convolutional neural network (CNN) [
21,
22,
25] and bi-directional long-short-term memory (LSTM) [
24], were applied in percussion-based classification tasks to improve accuracy of the classifiers.
There are still important topics that need to be further discussed and investigated, following the overview of the current literature presented above. Three such issues are enlisted as follows. 1) Existing weld defect detection method might need costly experimental instruments and sophisticated operating personnel. 2) Not many existing weld defect detection methods can identify subsurface weld defects. 3) Although smart algorithms were proposed to support percussion-based approaches, not much experimental work was done to evaluate the accuracies of the implemented algorithms.
To address these problems, this paper proposes an automated steel weld defect classification methodology (referred as AWDC hereafter) to intelligently classify four kinds of weld statuses through a feature data set processed from percussion sounds by using 7 classifiers. 7 classifiers include 4 common ML algorithms, CNN-enhanced RF, CNN-enhanced KNN, and CNN-enhanced SVM. 4 common kinds of ML algorithms, RF, KNN, SVM and HMM, were used to conduct classification tasks, and CNN was used to improve the classification performance of RF, KNN, and SVM as classifiers. The effectiveness and accuracy of the proposed method were investigated by a set of carefully designed experiments. The results were analyzed to identify the better classifiers for identification of weld statuses.
2 Methodology
The AWDC methodology aims at two consecutive goals: 1) judgement of whether a steel weld has defect(s); and 2) where weld defects exist, classification of the type(s) of defect. This section provides an overview of the proposed methodology, as illustrated in Fig.2. The methodology of this paper has an architecture of three key steps: 1) raw data collection, 2) data set preprocessing, and 3) results and verification.
2.1 Raw data collection
As the proposed methodology is rooted in the ML algorithms that require training, the first step, “raw data collection,” is to collect the raw signal records. Handy tools such as hammers and metal beads can be used to generate the percussion sounds. Smart phones can be used as the recorder of the raw sound signals.
Meanwhile, in order to train the classifiers, a group of welded steel specimens are needed with known weld statuses. For example, some of these specimens can be sampled and collected from the steel welds of real projects. The presented approach can be more accurate for smaller and more specific problems. For example, if all the specimens in problem were L-shape steel angles, or if all the specimens were made of a single type of steel (say high-strength steel welded-joints), the approach could be more accurate.
Automated devices such as automatic percussing machines and accompanying recording devices could be used to classify the weld statuses of the specimens collected in the field. When data are collected from real projects, the weld defects for the training data set can be identified using other techniques introduced in the Introduction to this paper. Among the various defects, the LF is the hardest to identify because imaging-based approaches don’t work in such cases.
The welded steel specimens as well the recorded percussed data need to be sorted according to different weld statuses. Moreover, considering the accuracy of the methodology, it is desirable to collect as many raw signals as possible. At least thousands of raw signals are required. Meanwhile, for the same weld, different positions can be percussed several times to get multiple recordings. For example, when a 2-cm P is discovered, multiple records can be collected by percussing different locations along the defect. Furthermore, when the number of welded steel specimens in the field is still insufficient, in addition to these specimens, similar forms of steel members with deliberately-added weld defects can be manufactured to increase the size of the training data set.
There are two more points that should be added to fully describe the raw data collection step to improve the generalization capacity of the trained network. First, the percussion strength and the percussion distance can be different for each of the records. In other words, the percussion strength and the percussion distance are not required to be exactly the same during all measurements. In addition to the artificial methods, the automated percussion machines that conduct pre-defined percussion actions under different strengths and distances can be used. Second, the background environment noises are unavoidable and are frequently being recorded. That is, the data recording process could be done in different locations. The raw signals may contain low environmental noises such as walking sounds, human voices, machine noises, and so on. The records are considered valid when the peak volume of these environmental noises is obviously weaker than the percussion sound.
2.2 Data set preprocessing
The second step is the “dataset preprocessing” step. This is an intermediate step whose task is to process the raw signals and to generate the data set with predefined data structures that are compatible with the classifiers involved within the subsequent procedures.
The first procedure is to truncate the raw signals into the equal lengths, covering the full range of all the percussion sounds. It turns out that the truncated data clips are usually less than a second. Then, it is further encouraged to expand the records based on the truncated data clips. Expanding the database could also improve the generalization of the classifiers. Multiple approaches (e.g., adding random Gaussian noises, time stretching, and time shifting) can be used to generate new data signals.
Then, the second procedure is to extract features of the data signals. A few methods are worth mentioning. The Mel frequency cepstral coefficients (MFCC) method was earlier designed to recognize the human voice. Later on, the continuous wavelet transform (CWT) and the short-time Fourier transform (STFT) were developed for more generalized non-stationary signal processing. The highlights of these extracting procedures are presented in Appendix.A in Electronic Supplementary Material.
2.3 Results and verification
The final step is designed to verify the results of the AWDC methodology. To evaluate the performance of new signals of a classifier, approaches such as hold-out, cross validation, and leave-one-out can be used to divide the processed data set into the training set and the testing set [
32]. Besides, the stratified sampling can be used while dividing the data set to ensure the independent and identical distribution in training set and testing set. In addition, many approaches can be used as classifiers, such as ML algorithms (e.g., RF, KNN, SVM, and HMM) and ANN algorithms (e.g., CNN and LSTM). To obtain more accurate classification results, this paper recommends the CNN-enhanced approaches to process the extracted features, which is a conclusion obtained from the subsequent sections.
First, the selected classifiers are trained by the training set. The hyperparameters of the algorithms are usually adjusted over a couple trials to improve the accuracies of the training process. Then, when the classifiers are well-trained, results are attained from the testing set. At this stage, the benefits of the selected classifiers could be compared. The evaluation metrics can be accuracy, recall, precision [
18], computational efforts, etc.
3 Experimental evaluation of automated steel weld defect classification methodology
Experiments were designed to validate the effectiveness of the proposed method. As a proof-of-concept study, the targeted steel weld specimens are described as.
1) All the tested specimens were directly manufactured in the workshop with deliberately-added weld defects. That is, the specimens were not sampled from a specific project.
2) The commonly seen steel angles and steel T’s were used as types of specimens. The impact of different steel material was not evaluated, for simplicity. The commonly seen carbon steel was used to make the specimens.
3) Three kinds of weld defects (i.e., LF, P, and SI) were deliberately added to the specimens during the manufacturing process. All the defects were distributed over the full length of the specimens.
4) Electrode arc welding was used. When deliberately producing the defects, porosities were controlled by the distance between the electrode and the specimen. SIs were created by sprinkling the slag during the welding process. For LF defects, the welder performed fast welding to hollow the weld seams.
To summarize, Tab.1 lists features of each specimen. The features include shape, type, and dimension of each specimen, as well the number of specimens for each type of weld status. Correspondingly, the photos of representatives of the specimens are presented in Fig.3, where the sub-figures show the weld details.
3.1 Raw data collection
In this study, a steel sphere (10-mm diameter) with a 3-mm hollow cylinder handle was used to percuss the specimens. A smart phone was placed on the same side of the specimen as the steel sphere to record the raw sound signals, as is shown in Fig.4. The percussions were conducted manually. Additionally, the recording software provided a visual display of the volume. The visual display allowed determination of whether the percussion sound had the maximum volume. If the volume reached its highest point during percussion, the recording could be saved.
The experiment was carried out in a classroom environment with light background noises. Most of the collected signal records lasted for about a second, with a range between 0.817 and 2.496 s. Afterwards, the collected signals were labeled (i.e., Q, LF, P, and SI) for the subsequent training procedures.
3.2 Data set preprocessing
To process the collected data, the raw recordings in the format of “.m4a” were first converted into “.wav” files by the pydub class “AudioSegment” [
33] in Python 3.7. While processing each of the raw signals, the first step was to obtain its peak volume and the corresponding index of the time frame. Then, the start point was set at 0.1 s before the frame of the peak volume, and the end point was set at 0.4 s after the frame of the peak volume. Therefore, a 0.5 s time frame was selected as it was adequate to cover the durations of all the percussion sounds. Then, each of the truncated percussion signal was saved as a “.wav” file with 48 kHz sampling rate.
After the raw data signals were collected, two approaches (Gaussian noises and time shifting) were used to expand the data set. Gaussian noises were added according to signal-to-noise ratio (SNR) [
22]. Smaller values of SNR indicate larger noises present in the data signals. Then, the truncated original data set was further expanded with 9 additional data sets of the same size, whose features were: 1) no shifting with SNR = 50; 2) no shifting with SNR = 35; 3) no shifting with SNR = 20; 4) 0.1 s shifting with no noise; 5) 0.2 s shifting with no noise; 6) 0.3 s shifting with no noise; 7) 0.1 s shifting with SNR = 50; 8) 0.1 s shifting with SNR = 35; 9) 0.1 s shifting with SNR = 20. The processing methods and the examples of the expanded signals are depicted in Fig.5.
Afterwards, the data files were further processed to extract their features. Specifically, the MFCC, the CWT, and the STFT features were employed. While extracting the MFCC features, a sliding window frame was designed to conduct frame blocking as described in Appendix.A in Electronic Supplementary Material. Specifically, as shown in Fig.6, the sliding window frame had 2048 sample points and the stride length covered 512 sample points. Each truncated signal included 0.5 × 48000 = 24000 sample points. The figure only illustrates a part of these points (i.e., from 1024 to 19200) for demonstration purpose. Referring to Appendix. A in Electronic Supplementary Material, in this case, each truncated signal had 47 window frames and each window frame had 120 MFCC features. Furthermore, the 120 MFCC features included 40 original coefficients, 40 first derivative, and 40 s derivative coefficients. All MFCC features were normalized and used as the inputs into the 7 different classifiers.
Then, to extract CWT features, 99 Morlet wavelets with different frequencies were generated and convoluted with the data signals to generate the CWT feature matrix of size 99 × 24000, where 24000 was the number of the sample points in a signal. For the STFT approach, the size of a STFT feature matrix was 1025 × 47, where 1025 represented different frequencies, and 47 corresponded to 47 window frames. The numbers of the STFT feature matrices were the decibel values of the corresponding frequency in this time window frame. The CWT and STFT feature matrices were directly fed into the four ML algorithms (i.e., RF, KNN, SVM, and HMM). For the CNN-enhanced algorithms, the CWT and STFT feature matrices, together with the original data signals for four different weld statuses, were converted to contour images that are illustrated in Fig.7. The image conversion is due to CNNs’ ability to capture spatial relationships and extract meaningful features from visual data.
Therefore, combined with the above information, the number of features of these three methods is sorted from small to large as MFCC, STFT, and CWT.
3.3 Results and verification
5-fold cross validation was used to verify the reliability and the generalization of the AWDC methodology. The concept of 5-fold cross validation, as shown in Fig.2, divides the total data set into 5 groups of identical size. Specifically, in this paper, 20% signals of each type of weld statuses were randomly selected to make a single group (i.e., stratified sampling). Then, five validations were performed. For each of the validation, one of the groups was selected as the testing set and the rest were chosen as the training set. The study divided the original data set (OD) (1897 featured data signals) into three 379 groups and two 380 groups. Correspondingly, the groups of the training set included either 1518 or 1517 signals. Similarly, each single group of the expanded data set (ED) included featured data of 3794 signals. Therefore, the testing set of the ED included 3794 signals, while the training set included featured data of 15176 signals.
The classifiers used in this paper include 4 ML algorithms (i.e., RF, KNN, SVM, and HMM). The key hyperparameters of these algorithms were adjusted to strike a balance between accuracy and efficiency. Specifically, the number of primary estimators was set to 100 for RF. The optimized ‘K’ value in KNN was selected over the range from 1 to 50. The optimal K value is selected based on the accuracy of the classifier. The kernel of SVM was radial basis function, and the decision function shape was “one versus one (OVO)” (as explained in Appendix.B in Electronic Supplementary Material). The optional range of the hidden states in HMM was set from 2 to 24, and the iteration number was 1000.
Nonetheless, the accuracies of these ML algorithms are limited by their natures. To tackle this problem, the CNN-based processing of the extracted features can potentially lead to more accurate results [
34,
35]. The architecture of a typical CNN is explained in Appendix. C in Electronic Supplementary Material. Consequently, in this study, CNN was used to enhance RF, KNN, and SVM. The HMM algorithm was not enhanced as the inputs of the algorithm are 2-dimensional non-image features.
The data received by ML is in the form of stacked data, which means it has only one dimension of featured data, excluding the batch size dimension. On the other hand, CNN + ML received image-like data, which means it has three dimensions of feature data, excluding the batch size dimension. As for HMM, it is suitable for handling sequential data, so it has two dimensions of feature data, excluding the batch size dimension. The inputs’ shape of each algorithm is listed in Tab.2.
The architecture of the CNN used in this paper is shown in Fig.8. Specifically, to test the concept, the operation was divided into route 1 and route 2 in the study. Route 1 was the regular approach where the featured data was directly fed into the traditional ML algorithms. Route 2 was the CNN-enhanced algorithms of RF, KNN, and SVM. The CNN had 4 convolutional layers, 4 batch normalization layers, 4 activation (i.e., Rectified Linear Unit (ReLU [
36])) layers, 4 max pooling (MP) layers, and an adaptive average pooling (AAP) layer before the pooled map. Since the filters were trained by backpropagation, CNN could automatically find the most suitable features for classification. The four-layer convolutional operation was designed by trial and error, as the layers were not too deep yet helped to extract decent features. It was derived that the classification accuracies of two- or three-layer configurations were not as high as that of the four layers. This was due to the development of the neural networks’ depth and its impact on performance [
37,
38]. The output shapes of the trainable layers and the configurations of the hyperparameters of each layer are listed in Tab.3 and Tab.4.
4 Evaluated results and discussion
After the experimental setup and the data set preprocessing, this section represents the results, followed by conclusions and brief discussions. A data set (1897 signals) was collected from 21 different specimens shown in Subsection 3.1. Subsequently, the classification accuracies and the computational efforts of the aforementioned features extraction methods and ML algorithms and CNN-enhanced algorithms were compared, including the ED.
4.1 Results of classic machine learning algorithms
Fig.9 and Fig.10 show the average accuracy of different classifiers for different features of the OD and the ED, respectively. The average of the 5-fold accuracies is marked out in the figures. In this experiment, the relatively better feature extraction method among MFCC, CWT, and STFT was evaluated along with that of the classifiers. For the RF, KNN, and SVM with STFT features received accuracies higher than 92% for both the OD and the ED. The HMM with STFT features, on the other hand, were 88.4% and 86.2% accurate for the OD and the ED, respectively.
Fig.11 illustrates the average identification accuracy (in %) and computational efforts of each feature extraction method with different classifiers. The computer used in this study had 2 Intel Xeon E5-2696 v4 CPUs (2.2 GHz), NVIDIA GTX 1660 Super (14 GB) GPU, and 256 GB RAM. The ML algorithms were trained in CPU, while the CNN-enhanced algorithms were trained in GPU. As Fig.11 shows, the STFT extracted the most suitable feature for the weld defect identification problem in this case. Such a conclusion also matches with those from Fig.9 and Fig.10.
Furthermore, Fig.12 shows the confusion matrices of RF, KNN, and SVM with the STFT features of the OD. The diagonal values within these matrices (marked out in Fig.12(a)) indicate higher accuracies of these algorithms. The off-diagonal values (marked out in Fig.12(b)) indicate larger errors. For example, as marked out in Fig.12(e), 8.82% of the Q defects were classified as the LF defects in this group.
It can be seen that, for the performance of the worst group among the five groups, the RF classifier still demonstrated high accuracies in Q (99%), LF (100%), and P (98%) weld statuses, while the accuracy of the SI defects was lower, near 92.5%. Generally, the KNN demonstrated subpar performance relative to the other algorithms. The SVM was more reliable in classifying all the four types of weld statuses.
Finally, the computational efforts (including both the training time and the testing time) were evaluated, as shown in Fig.13. RF, KNN, and SVM consumed the low computational efforts in both of the OD and the ED. The training computational efforts of the CNN-enhanced ML algorithms were apparently larger than those of the traditional ML algorithms (i.e., 10 (OD) or 4 (ED) times computational efforts per signal on average). Comparatively, the testing computational efforts of the CNN-enhanced ML algorithms also was larger (i.e., 8 (OD) or 2 (ED) times computational efforts per signal on average). The differences were not as significant as those of the training process. Only the HMM-based approaches required relatively more computational efforts. Most of the algorithms required only a few hours of computational time. Therefore, it can be concluded that the increased computational efforts of the enhanced approaches are acceptable.
4.2 Results of convolutional neural network-enhanced machine learning methods
The process of the pretraining followed the typical training process of the CNN, where the hyperparameters were adjusted to aiming at higher training accuracy. Once the pretraining was finished, the testing sets were used as the inputs of the CNN. Then, the flatten features (as shown in Fig.8) were obtained from the AAP layer. These flatten features were used as the inputs into the 3 ML algorithms. The accuracies of different classifiers for different features of the OD and the ED are listed in Fig.9 and Fig.10. It is evident that the accuracies of the CNN-enhanced algorithms were improved relative to those of the traditional ML algorithms.
In addition, Fig.14 shows the confusion matrices of the CNN-enhanced SVM for STFT features of the ED. It can be concluded that the CNN-enhanced ML algorithms demonstrated excellent performances for all weld statuses (mostly 100% accuracy, with 99% accuracies in the case of the SI defects). Compared with the regular approaches (i.e., route 1 in Fig.8), the accuracy of the CNN-enhanced ML algorithms was improved up to 12.75%.
Some of the training processes are shown in Fig.15. It can be noticed from the figures that the numbers of iterations for the CNN to converge varied for different data sets. Generally, the larger the data set was, the faster the convergences of the training process.
In addition, the average accuracy with different features for different data sets is shown in Tab.5. While Tab.6 shows the average accuracy of different classifiers.
5 Conclusions and limitations
This paper proposes the AWDC to intelligently classify four types of weld statuses through processing a feature data set based on percussion sounds. Four traditional ML classifiers and three CNN-enhanced classifiers were investigated in the study. The following conclusions can be drawn based on the results of the study.
1) The AWDC was proved to be valid and effective in classifying the four different types of steel weld statuses. The percussed approach combined with the ML-based algorithms required much less instruments or human labor.
2) The experimental results also proved that the AWDC methodology could be used to identify both the subsurface (e.g., LF) and the surface (e.g., P and SI) weld defects. This shows the advantage over image-based approaches.
3) Different feature extraction methods and different classification methods lead to different identification performances. The classifier based on the CNN-enhanced SVM outperformed the others for the steel weld identification problems in this study. Meanwhile, RF and SVM were also efficient approaches that struck a balance between accuracies and computational efforts. STFT features per signal were less than those of CWT but more than those of MFCC. Among the three feature extraction methods (i.e., the MFCC, the CWT, and the STFT), the features extracted by STFT corresponded to higher accuracies, near 99%, when classifying the steel welds.
4) HMM, which is generally considered suitable for the time series modeling problems, was also used as a classifier in the study. The results indicated that in this specific case, the accuracy (about 80.3% on average) and the efficiency (about 200 h on average in both training and testing) of the HMM were not the best among the other investigated algorithms. Nonetheless, for the other problems, the performance of the HMM approach is still to be evaluated and HMM could serve as a strong candidate for sound-based classification problems.
Furthermore, there are some limitations in this study.
1) During the percussion process, the cylinder handle, the base steel material, and many other factors may also impact the percussive sound characteristics. Meanwhile, different sizes of cylinder handle and different percussed objectives may also lead to different effects.
2) The specimens were purely manufactured for the experiment. All the weld defects were deliberately-added rather than collected from real projects.
3) Only two kinds of joint shape (L and T) were considered in the experiment, while there are many other joint shapes like the butt joint, overlap joint, and cross joint. Whether the proposed method applies to all these weld joints and the combination of these joints remain to be further explored.
4) The accuracies and computational efforts of the HMM and the CNN-enhanced classifiers are highly dependent on the complicated hyperparameters, which could be further improved.
The following target are therefore desirable in the next stage.
1) Attempts to use different percussed objectives such as gavels.
2) Collection of more data from in-service engineering.
3) Use of different kinds of wavelets to obtain the CWT coefficients prior to the classifications.
4) Comparison of the sensitivities of different hyperparameters in terms of accuracy and computational efforts.