1 Introduction
Laser-induced breakdown spectroscopy (LIBS) is a powerful atomic emission spectroscopy technology. Since its first report in 1962, LIBS technology has been regarded as the future superstar for chemical analysis due to its unique advantages, including simple sample pretreatment, fast detection speed, small sample damage, and less sample and environmental limitation. With the advent of high-performance lasers and spectrometers, the analytical performance of LIBS technology has been significantly improved, and it has been widely used in various fields, such as industry, agriculture, biology, medicine, and space exploration. However, the laser-induced plasma is spatially inhomogeneous and drastically temporal varying due to intense interaction among the laser, sample, plasma, and surrounding gases [
1,
2]. Therefore, the quantitation results of the physical principle-based model are mostly unsatisfactory because these models apply only one or a few spectral line intensities and was not able to compensate for matrix effect as well as signal fluctuations. How to effectively improve the quantitative analysis performance of LIBS remains a critical challenging issue for all LIBS researchers.
LIBS is a complex spectrum composed of ionic, atomic, and molecular spectral lines, which include information from the sample and environmental gases. Meanwhile, the LIBS spectrum is also influenced by factors such as intensity fluctuations, matrix effects, self-absorption, and spectral line interference [
3]. Therefore, signal acquisition and data processing methods have become important directions for LIBS research. Among them, the quantitative method established by manually extracting spectral feature information relies on personal experience, which is subjective and one-sided, making it difficult to fully utilize the useful information in spectral data. Machine learning algorithms have been an emerging interdisciplinary method, which can select useful information to the maximum extent from LIBS data with minimal requirement for subjective explanations [
4]. It is undeniable that various types of machine learning methods have emerged in LIBS in recent years, and it is necessary to summarize the progress and look forward to future development through a review. Zhang
et al. [
5,
6] summarized the research progress of chemometrics methods in LIBS from the spectral data pre-processing, qualitative and quantitative analysis a few years ago. In another review, the research situation and progress of the application of machine learning algorithms to LIBS were reviewed, and the problems and challenges such as over-fitting, under-fitting and spectra noise in the application of machine learning algorithms to LIBS still needs to overcome [
7]. The above reviews give a good summary of the current application of machine learning algorithms in LIBS, but do not focus on the key problems encountered by machine learning in solving the quantitative analysis of LIBS, such as matrix effects problems and signal uncertainty. Wang
et al. [
1] provided a guideline for LIBS researchers with basic knowledge for further quantification improvement, including the introduction of LIBS uncertainty generation mechanism, plasma modulation methods and quantification methods. It is noted that machine learning algorithms based on physical principles should be introduced into LIBS quantifications and it could be the final solution for cases with large amount of data available.
In this review, the recent progress of machine learning was comprehensively summarized and discussed around its methodology and application for data preprocessing, qualitative and quantitative analysis in LIBS fields. In particular, the role of machine learning algorithms in improving analysis repeatability and suppressing matrix effects were emphasized. Furthermore, the application prospects and suggestions for machine learning in LIBS are proposed.
2 Machine learning methods
Machine learning is a subfield of artificial intelligence (AI) that uses algorithms trained on data sets to create self-learning models that can predict outcomes and classify information. According to the learning approaches, machine learning can be separated into unsupervised learning, supervised learning, and semi-supervised learning, among which supervised learning builds a model based on labeled data, unsupervised learning is based on unlabeled data, and semi-supervised learning is based on a mix of labeled and unlabeled data.
A comparison of supervised, unsupervised, and semi-supervised learning is listed in Tab.1. To present a simple guideline for readers, these three types of machine learning were briefly introduced below, including algorithm principles, functions, advantages, and disadvantages.
2.1 Unsupervised learning algorithms
For unsupervised learning recognition, the distance between similar compounds is small in multidimensional space but it is larger between the different compounds, which leads to the analysis and cluster of unlabeled data sets. The most common methods of unsupervised learning recognition used in LIBS are K-means [
8−
10], principal component analysis (PCA) [
11], hierarchical clustering [
12], and iterative self-organizing data analysis technique (ISODATA) [
12]. K-means is computationally efficient and can handle large datasets with high dimensionality. However, the algorithm is sensitive to the initial selection of centers and can converge to a suboptimal solution. Furthermore, K-means is sensitive to outliers, which can have a significant impact on the resulting clusters. ISODATA algorithm is a modification of the K-means clustering algorithm, the clustering process begins with an arbitrary clustering average and is not limited by the initial center selection. PCA provides a dimensionality reduction of the original data set by the generalization of the original variance, which is done by a transformation of the original high-dimensional space into a smaller set of independent variables called principal components (PCs). Hierarchical clustering is simple and easy to use, but it will face the problem of insufficient performance for large-scale datasets due to the corresponding high time complexity. These algorithms are “unsupervised” because they discover hidden correlations in the data without human intervention. Unsupervised learning models are used in LIBS for three main tasks including clustering, association, and dimensionality reduction.
2.2 Supervised learning algorithms
The basic idea of supervised pattern recognition is that the samples with a known class as a training set are used to construct a training model, and then the class or grade of unknown samples can be predicted by the training model. The common supervised pattern recognition methods mainly include multiple linear regression (MLR) [
13−
15], partial least squares (PLS) [
16,
17] , soft independent modeling class analog (SIMCA) [
18], K-nearest neighbor method (KNN) [
19−
21], support vector machines (SVM) [
22−
26], artificial neural networks (ANN) [
27−
31], random forest (RF) [
32−
36], kernel extreme learning machine (KELM) [
37−
39], and linear discriminant analysis (LDA) [
40−
42].
Each algorithm has its own characteristics, and it is necessary to choose the appropriate algorithm and avoid its shortcomings in practical application. MLR method has the advantages of being high-speed, simple, and easy to implement, especially for small data and simple relationships. If the data distribution presents complex curves or the characteristics are not independent, MLR can not be utilized to build a model. PLS method is a multivariate statistical method used to find the basic relationship between spectral intensity and elemental content (or clustering label), which combines features of PCA and multiple regression. PLS has been widely used in LIBS data processing and efficiently handles high dimensionality and collinearity problems, however, the PLS classification or regression model is easy to fall into over-fitting for a small number of training samples. SIMCA is a pattern recognition method based on PCA, the limitation of SIMCA is that a non-optimized differential model would be generated when the difference between the classes is close to the differences from the class itself. KNN algorithms uses proximity to make classifications or predictions about the grouping of an individual data point. KNN makes the most direct use of the relationships between samples, reducing the adverse effects of improper selection of class features on classification results, and can minimize the errors in the classification process to the greatest extent possible. KNN is not sensitive to outliers, but has high computational complexity and spatial complexity. SVM is a linear method in a very high dimensional feature space that is nonlinearly related to the input space. But it does not achieve satisfactory results when training high-dimensional data, which is due to the consumption of huge computer memory and operation time. ANN is a processing system based on imitating the structure and function of the brain neural network, which have advantages in self-learning and processing nonlinear relationships. Of course, ANNs also have some disadvantages, such as high requirements for computing hardware, susceptible to overfitting, and difficulty interpreting the decision-making processes. RF is ease of use and flexibility have fueled its adoption in LIBS, as it handles both classification and regression problems. Compared to other methods, RF has higher accuracy, and higher speed, can balance the error data and evaluate the importance of each variable, especially avoiding over-fitting problems. KELM is an improved algorithm based on extreme learning machine (ELM) and combined with kernel functions. Some researchers applied a stable kernel function to replace the random feature space in ELM, which exhibits fast learning speed, better stability, and generalization performance.
2.3 Semi-supervised learning algorithms
Semi-supervised learning is a popular algorithm that serves as a bridge between the realms of supervised and unsupervised machine learning, the algorithm uses a small portion of labeled data and many unlabeled dates from which a model must learn and make predictions on new examples. the algorithms will cluster similar data using an unsupervised learning algorithm and then use the existing labeled data to label the rest of the unlabeled data. Semi-supervised learning is particularly useful when there is a large amount of unlabeled data available, but it is too expensive or difficult to label all of it. Nowadays, semi-supervised learning has been used in LIBS quantitative and identifying technology [
43,
44]. Li
et al. [
43] proposed a novel semi-supervised LIBS quantitative analysis method for high-alloy steel samples based on a co-training regression model with a selection of effective unlabeled samples. But only when effective unlabeled samples are chosen for model training, the predicting accuracy and generalization ability of the model can be effectively improved. Wang
et al. [
92] used semi-supervised learning combined with LIBS for explosive identification, compared with KNN, PCA and SIMCA, and the semi-supervised learning algorithm generate better results. Therefore, the semi-supervised algorithm has a good application prospect in the classification of samples by using the labeled data to guide the learning procedure.
3 Data preprocessing with machine learning
LIBS spectral data is highly complex and redundant, the spurious data and redundant information in LIBS spectra will worsen the model performance of classification or prediction. The data preprocessing can mine the useful and potential features in LIBS raw data, so as to improve LIBS analyzing performance by eliminating the redundant and irrelevant features. Fig.1 shows the role of machine learning for data preprocessing in LIBS. The application methods of data processing with machine learning in LIBS include denoising, de-interference, feature extraction, and variable reconstruction.
3.1 Denoise and de-interference
The noises and interference in LIBS have an important impact on the stability and reliability of spectral data, and the implementation of accurate quantitative LIBS analysis is always subject to noise interference. Machine learning algorithms have also been used to solve these problems, which can effectively expand the quantitative analysis performance and the application of LIBS.
Noise is an unavoidable and significant component of the LIBS signal, and different sources of noise seriously influence the prediction performance of the analysis model. To avoid using noise information as a feature, it is necessary to develop effective methods to filter noise [
45−
47]. Wavelet threshold denoising (WTD) is the most commonly used methods in noise filtering, which localize features in LIBS data to different scales and preserve important signal while removing noise [
48,
49]. In the process of WTD, a threshold with an adjustment factor was implemented to overcome the over-smooth phenomenon and improve model performance. As shown in Fig.2, wavelet analysis procedures can eliminate environmental denoising and background noise in LIBS spectrum [
50]. Based on entropy analysis of noisy LIBS signal and noise, Zhang
et al. [
51] presented a method for selecting the optimal decomposition level, which reduced the limit of detection values by more than 50%. Before constructing the analytical model, WTD and Kalman filtering were used to preprocess coal ash spectra, and WTD showed better performance in filtering noise [
52]. Yang
et al. [
53] proposed an empirical mode decomposition (EMD) based on the wavelet method to remove LIBS noise. Compared with other denoising methods, this method has a good denoising evaluation index and stability.
According to the data characteristics, some researchers have improved the traditional noise filtering methods. Xie
et al. [
45] proposed an improved soft-hard trade-off threshold method, in which the RSD of each trace element was significantly improved after denoising. Duan
et al. [
54] proposed an improved wavelet double threshold function, by which the spectral denoising effect of Cu and Zn models were improved, and the corresponding models performance was improved for Cu and Zn. Based on the entropy analysis of noisy LIBS signal and noise, Zhang
et al. [
51] proposed a method to select the optimal decomposition level. Experimental data analysis showed that this method can reduce the fluctuation of noisy signals and improve the signal-to-noise ratio of LIBS.
Line broadening due to plasma processes or the instrument degrade spectral resolution leading to uncertainty in the elemental profile obtained by optical emission spectroscopy [
55]. In the research of using machine learning algorithms to solve spectral line interferences, Liu
et al. [
56] developed an algorithm based on iterative discrete wavelet transform (IDWT) and Richardson-Lucy deconvolution (RLD) to reduce the impact of spectral interference and improve the accuracy of quantitative analysis. Tan
et al. [
57] studied an error compensation method based on the curve fitting method to realize the decomposition and correction of overlapping peaks. This improved the efficiency of quantitative analysis in LIBS by greatly reducing fitting residuals. Wang
et al. [
58] applied the Fourier self-deconvolution method to analyze overlapping peaks. This method effectively improved the detection of Pb concentration in polluted water by LIBS.
LIBS is strongly influenced by signal fluctuations, which are correlate to plasma properties, measurement conditions, and the physical properties of the samples. To reduce fluctuations of spectral intensity and increase the prediction ability of quantitative models, LIBS often requires to normalize spectra. Data standardization or normalization is correction of the LIBS signal by dividing a factor, which including the spectral background, total area, internal standard, standard normal variate, plasma characteristic parameters, acoustic signal induced by the shock-wave, and ablated mass [
59]. LIBS combined with data standardization based on machine learning methods has been reported. Wang
et al. [
60] designed a back propagation neural network (BPNN) model for standardizing the spectrum to a lower relative standard deviation (RSD) of emission line intensities, in which training spectrum, sample energy and image parameters were inputs. This data processing method not only provided a practical access to acquire stable spectrum information for both qualitative and quantitative LIBS analysis but also showed a bright future of combining LIBS data processing with machine learning methods.
3.2 Spectral data selection
LIBS is a hybrid spectrum that contains ions, atoms, and molecules of the elements in the sample and the ambient gas. Furthermore, due to plasma instability, operational errors, instrumentation abnormalities, or parameter changes, the collected LIBS spectra did not contain complete or correct plasma emissions, which were defined as spurious data or abnormal data. It is difficult to provide sufficient information for quantitative models with finite characteristic lines, and the use of full spectrum introduces a large amount of redundant data. To improve the robustness of the analytical model, machine learning algorithms are used to extract characteristic spectral lines strongly correlated with element content and to avoid spurious data or abnormal data.
Some researchers have used machine learning methods to identify and reject abnormal spectra. It is a popular method by combining prior knowledge with input selection algorithms to reduce the influence of redundant data on the analytical results and decrease the model complexity. Lu
et al. [
61] proposed a hybrid feature selection method combining with wavelet transform to analyze the heat value of coal using LIBS. The results of the study showed that feature selection method can effectively reduce the spectral dimension, remove irrelevant information, and select the relevant spectral data. As shown in Fig.3, not only characteristic lines but also some molecular lines or background contribute to the quantitative model.
Data feature extraction is beneficial to improve the LIBS analytical performance by screening the important wavelength variables and eliminating the effect of plasma uncertainty on the spectrum. Therefore, feature extraction algorithms has been used in combination with qualitative identification and quantitative analysis models to improve the analytical performance of LIBS. Xie
et al. [
62] used wavelet packet transform to select effective feature spectral lines and combined them with relevant vector machines to achieve accurate in situ component prediction. Chen
et al. [
63] proposed a weakly supervised method called spectral distance variable selection (SDVS), which utilizes prior information of samples to evaluate spectral feature weights. Compared with full-spectrum input and other feature selection methods, this method substantially improves prediction accuracy. Harefa
et al. [
64] used sequential forward selection (SFS) to eliminate the most irrelevant features, which required short computation time but improved the classification accuracy of four machine learning models (quadratic discriminant analysis, RF, Bernoulli naïve Bayes, and SVM). Chu
et al. [
65] proposed an approach using LIBS combined with the ensemble learning based on the random subspace method (RSM), which extracts important spectral lines (Na, K, Mg, Ca, H, O, N, C−N) from the LIBS spectrum of blood cancer samples, and the recognition ability of blood cancer types can be greatly improved. Therefore, a feature extraction algorithm can be used to judge the importance of each variable and maintain the most important variables [
66]. Kong
et al. [
67] proposed an automatic method to select analytical and reference lines for internal standard method from the original spectra based on GA. The featured optimal analytical and reference lines can effectively improve the quantitative accuracy. Gan
et al. [
68] used the uninformed variable elimination (UVE) algorithm to remove the non-information noise variables, and then the competitive adaptive reweighted sampling (CARS) algorithm was used to screen the important wavelength variables related to Pythium, the extracted feature lines can effectively improve the accuracy of PLS model. Ma
et al. [
69] applied GA to screen 12 wavelength variables related to the characteristic spectral lines of Ca, Na, and K elements in manure samples, and the method could significantly reduce the modeling variable information and improve the prediction accuracy of Ca content in manure using LIBS. He
et al. [
70] proposed a hybrid variable selection method mutual information-particle swarm optimization (MI-PSO) to realize precise screening of LIBS and Fourier transform infrared spectrometer (FTIR) spectral characteristic variables of coal samples. The MI was used to eliminate redundant variables in the spectral data, and the PSO was used to further filter the retained variables to find a set of variables with higher prediction accuracy. The algorithm mentioned above can more accurately predict the ash content and volatile matter of coal quality analysis. Duan
et al. [
71,
72] proposed an automatic variable selection method for quantitative analysis of soil samples using LIBS based on full spectrum correction (FSC) and modified iterative predictive weighted-partial least squares (mIPW-PLS), which automatically selects features without artificial processes. To illustrate the feasibility and effectiveness of the method, a comparison with GA and successive projections algorithm (SPA) for different elements (copper, barium and chromium) detection in soil was implemented. the method requires short computation time and improves prediction performance of quantification models. Recursive feature elimination can effectively reduce redundant variables and prevent overfitting. Lu
et al. [
73] extracted effective features from the de-noised LIBS spectrum using the recursive feature elimination with cross-validation (RFECV) method. According to the selected features, SVR model of coal was established. The performance of the models was significantly improved compared with the original model. Wang
et al. [
74] used an RFE method based on ridge for feature selection. The results showed that the root-mean-square error prediction (RMSEP) was significantly reduced compared with the PLS model with full spectrum as input. Ruan
et al. [
75] proposed to combine sequence reverse selection with RF for quantitative analysis of phosphorus and sulfur in steel, and the results showed that the RF model based on sequential backward selection (SBS) had a better prediction effect than the univariate method, the PLS model and the traditional RF model. Ruan
et al. [
76] also proposed an improved backward elimination feature selection method. Compared with the predicted results of the RF, VI-RF, and SBS-RF models, the improved SBS-RF model has higher sensitivity, specificity, and accuracy. To improve the accuracy of the model, Ding
et al. [
77] proposed a mean decrease accuracy (MDA) and mean decrease impurity (MDI) feature selection methods of RF to filter the LIBS data. Four models were constructed using convolutional neural networks (CNN) and RF: MDA-CNN, MDA-RF, MDI-CNN, and MDI-RF, and applied to predict soil sources. The experimental results indicate that this analysis method can effectively determine the soil source. You
et al. [
78] used RF for variable selection to reduce the number of characteristic variables from 100 to 6, which significantly reduced the interference of irrelevant spectral lines. Lv
et al. [
79] proposed a feature extraction method that combines the linear regression (LR) and the sparse and under-complete autoencoder (SUAC) neural network. This method performs nonlinear feature extraction and dimension reduction on high-dimensional spectral data.
In LIBS, matrix effects are due to changes of emission line intensities of some elements when the physical properties and/or the chemical composition of the sample matrix varies. The matrix effects limit the performance of LIBS in absolute elemental analysis, since the spectral intensity of an emission line at a given concentration depends on the matrix. Many studies have focused on solving the influence of the matrix effects by data preprocessing. Wu
et al. [
80] used CARS method to select characteristic and related variables of Cr in LIBS spectra from edible vegetable oil and establish calibration model using LSSVM based on the selected variables. The number of variables was reduced from 132 to 10, and the CARS-LSSVM can reduce the influence of matrix effects on analytical element and improve prediction accuracy of LIBS analysis. Zhu
et al. [
81] proposed a multi-spectral line internal calibration method for quantitative analysis of Pb element in irregular lead-brass alloy samples. The linear fitting degree of the calibration curve reached 0.9846, indicating that this method can to some extent eliminate the influence of matrix effects and spectral interference, and significantly improve the measurement accuracy. Long
et al. [
82] proposed a data selection method based on plasma temperature matching (DSPTM) to reduce both matrix effects and signal uncertainty. By selecting spectra with smaller plasma temperature differences for all samples, the proposed method was used to build up the univariate and multiple linear regression (MLR) model to rely more on spectra with smaller matrix effects and signal uncertainty, therefore improving final quantification accuracy and precision.
3.3 Variable reconstruction
The purpose of variable reconstruction is to reconstruct a new variable based on the extraction data that are sensitive and useful to concentration of the element to be measured in LIBS spectrum. In some cases, a new set of derived features can provide better interpretability than the original LIBS data. In the method of variable reconstitution using machine learning algorithms, in addition to directly selecting the relevant spectral line information of the element to be measured from the LIBS spectrum, the spectral line information of the non-measured element is sometimes extracted, and the selected information is reconstituted into a new variable.
PCA is one of the most used variable reconstitution method, which recombine many variables with certain correlations into a new set of unrelated comprehensive variables to replace the original variables. PCA can effectively reduce the data dimension and retain most of the information of the original LIBS spectrum [
83]. In addition, PCA can also be used for data preprocessing, visualization, dimensionality reduction, model building, classification, quantification and non-conventional multivariate mapping [
84]. Sirven
et al. [
85] introduced PCA into the processing of LIBS data for outliers filtering, they used basic visual thresholding and omitted up to 30% of spectra prior to a rock classification in a preflight ChemCam testing. Abdel-Salam
et al. [
86] utilized PCA to extract features from the LIBS spectrum of recent and ancient bovine bone samples, and the first two principal components constituted 90.3% of the total variance were used for establishing identification model as shown in Fig.4. Farhadian
et al. [
87] used the first three components which cover 96% information of LIBS data as the input variables of ANN model, and the accuracy reached 100% for the identification of energetic materials in the Ar atmosphere. Yuan
et al. [
88] extracted 13 principal components as the input variables of SVM model, and the classification accuracy reached 100% for the rapid classification of steel materials.
In addition, more and more new variable reconstruction algorithms have been proposed and used to improve the performance of LIBS quantitative analysis. Li
et al. [
89] used GA to select the intensity ratios of the spectral lines belonging to the target and domain matrix elements, and then these selected line-intensity ratios were taken as inputs to construct an analysis model based on an ANN to analyze the elements copper (Cu) and vanadium (V) in steel samples. The results showed that the GA combining ANN can excellently improve the prediction accuracy of Cu and V elements in steel samples compared with traditional internal calibration methods. Zhong
et al. [
90] introduced the concept of standardized root mean square error of cross-validation (SRMSECV) to select the median area of all spectra of the same sample as the center and discard the spectra outside the spectral area interval. Under the optimized areal screening span, the average of determination coefficients (
R2) and the accuracy of multi-element analysis were improved to some extent. Neural networks are a subset of machine learning, and one of the most impressive forms of ANN architecture is that of the Convolutional Neural Network (CNN). Dong
et al. [
91] proposed a lightweight CNN model, which extract spectral low-level features through the first three convolutional layers, this model can improve the accuracy of quantitative analysis of flow slurry by solving problems such as matrix effect, self-absorption effect, and limited sample size with more dimensions. In addition, machine learning algorithms have also been used for parameter optimization in LIBS experiments. Prochazka
et al. [
92] developed an ANN algorithm to predict the signal-to-noise ratio (SNR) of selected spectral lines based on specific experimental parameters (laser pulse energy and gate delay) and on the sample’s physical and mechanical properties. It has been concluded that the optimization process can be substituted or significantly shortened by means of the ANN.
In summary, to further improve the accuracy and precision of the model based on machine learning algorithms, various data preprocessing is utilized before the analysis model is established, including feature extraction, valiable reconstruction, noise filtering, interference processing, matrix effects, and self-absorption correction. Among them, feature extraction from LIBS spectra and plasma images are the most essential data preprocessing methods, which are to maximize the effective information and avoid interference of redundant or invalid information (such as noise and baseline in LIBS spectra) in the LIBS spectrum to the accuracy/precision and efficiency of analytical models.
4 Qualitative and quantitative modeling with machine learning
Based on the data preprocessing method mentioned above, it is ready to be used as an input variable for the analysis model. LIBS-based analysis models can be divided into qualitative and quantitative categories, qualitative analysis is the identification, clustering, or classification of samples, while quantitative analysis is the analysis of the elemental composition of samples. In this section, the research progress of machine learning algorithms for improvemants in the qualitative and quantitative performance of LIBS are reviewed in detail. Fig.5 shows the role of machine learning for qualitative and quantitative analysis in LIBS.
4.1 Qualitative model
Affected by matrix effects, the physical properties and composition of the sample will affect the element signal. LIBS analysis technology displays significant matrix effects which greatly hinder the application of this technology. However, the matrix effects are not always unhelpful for analytical results, which are beneficial for sample classification. The most basic kind of qualitative analysis is the clustering or identification of data points according to their mutual similarities. Qualitative models are often organized in a graph (dendrogram or scatter plot) and utilized for the discrimination of objects based on their characteristic spectra. The application of machine learning algorithms in the qualitative analysis of LIBS is manifested in three aspects: (i) establish and optimize qualitative analysis model; (ii) improve the accuracy of clustering or classification; (iii) realize automatic prediction.
Establish and optimize qualitative analysis model. To find suitable machine learning algorithms for clustering or classification various types of samples, researchers have conducted extensive algorithm comparison studies. Vítková
et al. [
40] used subset of PCA scores in conjunction with both ANN and LDA as variables for classification of 18 different biominerals, and the ANN model showed better performance than the LDA model. The method can create a database for simple and fast identification of archeological or paleontological materials in situ. Tang
et al. [
93] compared the predictive performance of four different machine learning methods (PLS-DA, SVM, RF and RF based on variable importance (VIRF)) for slag samples classification, and VIRF showed the highest classification accuracy. Yang
et al. [
94] studied the classification performances of six machine learning algorithms (PCA, DT, RF, PLS-DA, LDA, and SVM) for the geographic origins of 20 kinds of rice samples, and LDA was demonstrated to be the most efficient tool for rice geographic origin classification assisted by LIBS with high accuracy and analytical speed. LDA project the high-dimensional sample data into the optimal discriminant vector subspace (low dimension) in order to compress the dimension of the feature space and extract classification information. Alarsan
et al. [
95] identified heart diseases using three algorithms (DT, RF, and Gradient-Boosted Trees), and RF shows the highest accuracy of 98.03%. Because RF can balance the error data and evaluate the importance of each variable, especially avoiding over-fitting problems. Yu
et al. [
96] used several methods (PCA, PLS-DA, LDA, SVM) to classify nephrite samples from five different locations, and SVM showed the highest accuracy of 100% for predicting training data and 99.3% for predicting testing data. Gyftokostas
et al. [
97] demonstrated that LIBS coupled with machine learning (LDA, ERTC, RFC, and XGBoost) is a powerful tool for live oil authenticity and geographic discrimination. Zhao
et al. [
98] traced the geographical origins of acacia honey and multi-floral honey using SVM and LDA, the accuracy of the SVM model was 99.7% which was superior to the LDA model. Luo
et al. [
99] performed three pattern recognition methods (discriminant analysis, RBF-ANN, and MLP) to identify rice species, and the MLR model showed higher accuracy of 100% and 97.9% in the training and test sets, respectively. Huang
et al. [
100] adopted traditional machine learning methods (CNN, LDA, KNN, RF, and SVM) to identify 25 adulterated milk powders mixed with four different types of exogenous proteins, and SVM model obtained the highest accuracy of 93.9%. Kiss
et al. [
101] used with a K-means algorithm to cluster various matrices within the tumorous tissue, Typical skin tumors were selected for LIBS analysis,the imaging of biotic elements (Mg, Ca, Na, and K) provides the elemental distribution within the tissue. The elemental images were correlated with the tumor progression and its margins, as well as with the difference between healthy and tumorous tissues. Ding
et al. [
102] applied LDA and SVM to identify three kinds of plant leaves (Ligustrum lucidum, Viburnum odoratissinum, and bamboo), the average classification accuracy rate of SVM for the test set was up to 98.89%, which is better than LDA. Li
et al. [
103] applied PCA, KNN, and SVM to classify fat, skin and muscle tissues with an accuracy of over 99.83%, a sensitivity of over 0.995 and a specificity of over 0.998. Babu
et al. [
104] applied PCA and ANN to classify the unaged, gamma-irradiated, and water-aged specimens, the ANN-adopted LIBS analysis was successful with good classification accuracy compared to PCA. From the above study, it can be seen that supervised algorithms, such as SVM, LDA, ANN, and RF, have higher classification accuracy than unsupervised algorithms. Singh
et al. [
105] compared the analytical merits (accuracy and precision) of applying the PCR and PLRS algorithms to identical LIBS data from a set of stainless steel samples. A few guidelines were proposed in this work for selecting PCR or PLSR depending on the analytical situation. However, it is difficult to summarize the fixed and feasible algorithm selection law here, because the characteristics of the spectra excited by various substances are very different, and their advantages and disadvantages can only be compared under the same experimental equipment and the same experimental conditions.
Improve the accuracy of clustering or classification. The main research purpose of machine learning algorithms for LIBS qualitative analysis is to improve the accuracy of clustering or classification. It is necessary to choose the best one from different machine learning methods, or improve existing machine learning algorithms to establish an optimized qualitative analysis model. No data library is currently available commercially, and even if it was, the transfer between different LIBS systems is not possible yet. Therefore, each research group is building its database according to sample type, application requirements, experimental conditions, etc. [
84]. For example, PCA is very suitable for data visualization, and SIMCA and PLS-DA can achieve automatic prediction. SIMCA has lower sensitivity than PLS-DA, but its model is more robust to unknown samples [
84]. Different machine learning algorithms have their advantages and disadvantages, and in some cases, combining different methods can overcome the defects of a single algorithm and extract potential features from LIBS data.
In addition, the introducing of different preprocessing methods could also improve the classification accuracy and efficiency. Liu
et al. [
106] investigated the classification and identification of rice geographical origin using combined LIBS and hyperspectral imaging with machine learning method. PCA was utilized to realize data dimensionality and extract the data feat of LIBS, HSI and fusion data, and PLS-DA, SVM and ELM were used to achieve rapid and accurate rice quality and identity detection.
Realize automatic prediction. The above researches also show that the combination of LIBS and machine learning algorithms has been successfully used for automatic sorting of unknown samples. Several researchers have reported the recycling of metals use of LIBS and machine learning algorithms. Campanella
et al. [
107] developed a strategy based on LIBS and ANN for the sorting of aluminum scrap samples. The neural networks approach enables more reproducible results, which can accommodate the unavoidable signal variations due to the low intrinsic reproducibility of the LIBS systems. The results demonstrate the possibility of an efficient (> 75%) classification of non-ferrous metallic automotive scraps using LIBS and ANN method working in conditions simulating the industrial environment. Park
et al. [
108] studied a 3D sensing system for LIBS-based metal scrap identification, and PCA algorithms was used to reduce the dimensions of the wavelength data into principal components. The maximum classification accuracy of 95% was achieved when LIBS spectra were acquired from optimized rather than non-optimized sample surfaces. Demir
et al. [
109] used LIBS and PCA for the classification of 700 °C molten aluminum alloys without sample preparation.
To show the performance of different machine learning algorithms in LIBS field clearly, Tab.2 presents a summary of sample classification using LIBS combined with machine learning algorithms, including the types of machine learning methods, improvement ways, comparison methods, material types, optimal classification results, and references. As can be seen from Tab.2, PCA is widely used in combination with other machine learning algorithms for qualitative analysis of LIBS, because PCA has excellent correlation data extraction and dimensionality reduction capabilities, which can effectively eliminate the influence of noisy data and redundant data on classification results and improve computing efficiency.
4.2 Quantitative model
Quantitative analysis is the most important goal of any analysis technique. However, due to the complexity of the interaction process between laser and material, the limitations of experimental instruments, matrix effects, self-absorption effects, and environmental conditions, the quantitative analysis ability has always been a bottleneck in the development of LIBS technology [
133]. To capture the complex relationship between spectra and analyte information, machine learning methods often have a higher model complexity than traditional univariate calibration method based on physical principles. On the one hand, if there are too few samples used for training, the quantitative analysis model is prone to over-fitting. In practice, a data-driven model may require enough calibration samples to ensure that its applicability domain covers the validation sample. Of course, the large use of standard samples increases the computational time and the cost of LIBS quantitative analysis. Furtermore, most machine learning models are not designed with model interpretability. It is difficult to judge whether the decision-making process conforms to the physical principles behind LIBS. Such an issue may reduce the robustness of LIBS qualification and quantification based on machine learning methods [251]. The research of machine learning algorithms in the quantitative analysis of LIBS focuses on the following four aspects: (i) improve the performance of the quantitative analytical model; (ii) eliminate the effects of matrix effects and self-absorption effects; (iii) adaptable to detection under complex conditions.
Improve the performance of the quantitative analytical model. The machine learning algorithms used to construct the quantitative analysis model of LIBS include a single-algorithm model and a multi-algorithm model. The single machine learning model is a prediction model based on only one machine learning algorithm, which is simple to implement and efficient. In most cases, the combination of preprocessing methods and a single machine learning algorithm can accurately determine the element content in the target sample, and the machine learning algorithm and corresponding parameters should be optimized according to the specific applications [
134].
PLSR is one of the most widely used quantitative analysis method due to its excellent performance and simple calculation. Akhmetzhanov
et al. [
135] achieved quantitative detect rare-earth elements (REE) in ores using PLSR, which solved the problems of significant overlapping of REE lines in LIBS emission spectra and high pairwise correlation between REE contents in certified reference materials (CRMs). They confirmed that using PLSR can compensate for the low resolution of handheld LIBS instruments and achieve quantitative analysis of Ce and La in REE-rich ores [
136]. Rao
et al. [
137] used PLSR, PCR, and ANN to detect trace elements in plutonium, PLSR is superior in determining iron and nickel contents in plutonium metal, with a limit of detections (LoD) of 15 and 20 ppm, respectively. In Rao’s another work, they created a boosted regression ensemble model (boosted regression tree, BRT) to predict silicon content in ceria pellets doped with silicon [
138]. Its predictive accuracy was higher than that of traditional PCA, PLS, and ANN regression models. Gu
et al. [
139] applied conditional univariate quantitative analysis, MLR and PLSR to the quantitative analysis of steel alloys. PLSR showed low relative errors for two unknown steel alloy samples with values below 6.62% and 1.49%, respectively. Yaroshchyk
et al. [
140] applied PCR, PLSR, multi-block PLS, and serial PLS to the quantitative analysis of Fe content in iron ore. In comparison with PCR and PLS, the performance of the multi-block PLS algorithm is poor. Erler
et al. [
141] evaluated multiple regression methods including PLSR, least absolute shrinkage and selection operator regression (LASSO), and Gaussian process regression (GPR), for predicting Ca, K, Mg and Fe in soil. LASSO and GPR yielded slightly better results than PLSR. The advantages of GPR are mainly reflected in dealing with nonlinear and small data problems, and the model may fail when encountering high-dimensional spaces. Rao
et al. [
142] quantified gallium in cerium matrices via ensemble regressions, SVR, Gaussian kernel regressions, and ANN. Gaussian kernel regression is the best prediction model with RMSEP of 0.33% and an LoD of 0.015%. Yuan
et al. [
143] applied BP and MLR to study the content of forsterite and fayalite in olivine, and the root-mean-square error (RMSE) value of the BP model was the lowest (28.64). Shi
et al. [
144] applied SVR and PLSR to determine the concentrations of five main elements (Si, Ca, Mg, Fe and Al) in sedimentary rock samples, and found that the SVR model performed better with satisfactory accuracy. Ding
et al. [
145] applied KELM and PLSR for the quantitative analysis of total iron content and alkalinity of sinter, and the correlation KELM model shows better predictions for total iron and alkalinity with a correlation coefficient are above 0.9, which are all higher than that of PLSR. Wu
et al. [
146] applied RFR and PLSR to quantitative analysis of S and P elements in steel samples, RF calibration model made good predictions of S (
R2=0.9974) and P (
R2=0.9981). Xiang
et al. [
147] employed MLR, PLSR, LS-SVM and BP-ANN to quantitatively analyze heavy metals Pb and Cd elements in soil, and LS-SVM and BP-ANN offered promising results. Ye
et al. [
148] used LIBS combined with PLSR and RFR algorithms to measure chemical oxygen demand (COD) in river water samples, the results showed that RFR had a high
R2 value (0.9248) and low RMSE value (25.1215 mg/L). Labutin
et al. [
149] implemented a PCR method for the construction of a PCR model under spectral interferences using C I 833.51 nm line for carbon determination in low-alloy steels in air. The predicted carbon content in the rail templet was in an agreement with the reference value obtained by a combustion analyzer within the relative error of 6%.
Intelligent optimization algorithms have also been introduced into LIBS quantitative analysis. Sun
et al. [
150] implemented particle swarm optimization (PSO), GA and ant colony optimization algorithms to quantitatively analyze the Pb concentration in water, the values of mean relative error percentage and RSD of the test results obtained by PSO algorithm were the best among these algorithms. Intelligent optimization algorithms also have some shortcomings. The parameters selection seriously affects the quantitative analysis results of GA, and currently, the selection of these parameters mostly relies on experience. The PSO algorithm is simple and fast in solving problems, but it is prone to getting stuck in local optima. Ant colony algorithm has a high computational cost.
Machine learning algorithms can not only be used for quantitative analysis of elements in samples, but also quantitative analysis of other chemical properties of samples. Hao
et al. [
151] applied LIBS with PLSR to measuring the acidity of iron ore, the average relative error (ARE) and RMSE of the acidity achieved 3.65% and 0.0048, respectively. With the conventional internal standard calibration, it is difficult to establish the calibration models of the acidity for iron ore due to the serious matrix effects. PLSR is effective to address this problem due to its performance in compensating the matrix effects. This is because the interferential and nonlinear signals were eliminated by choosing the number of PCs during the model establishment. Wang
et al. [
152] applied VIRF, PLSR, and LS-SVM for the quantitative analysis of iron ore acidity, and the VIRF model showed excellent predictive performance. Similarly, Yang
et al. [
153] applied PLSR and RFR for measuring the basicity of sintered ore, which can be defined by the concentration of oxides: CaO, SiO
2, Al
2O
3 and MgO. The RFR model showed better predictive capabilities with an RSD of 0.27% to 0.59%. Lu
et al. [
154] applied PLSR and LS-SVR to quantitative analysis of pH value in soil, and LS-SVR effectively improved the analysis accuracy with the values of
R2, MAE, and RMSE of 0.987, 0.1 units (pH), and 0.079, respectively. Zhang
et al. [
155] applied PLSR, SVR, ANN, and PCR for the quantitative analysis of coal quality, and ANN model has the lowest average relative error (ARE) value, which was 0.69% (ash content), 0.87% (volatile matter content), and 0.56 MJ·kg
−1 (calorific), respectively. Képeš
et al. [
156] explored the application of ANN for predicting plasma temperatures, and they leveraged synthetic data to isolate temperature effects from other factors and studied the relationship between the LIBS spectra and temperature learnt by the ANN. Saeidfirozeh
et al. [
157] also developed an ANN method for characterising crucial physical plasma parameters (i.e., temperature, electron density, and abundance ratios of ionisation states) in a fast and precise manner that mitigates common issues arising in evaluation of laser-induced breakdown spectra. Thus, the introduction of machine learning algorithms has provided great convenience for the expansion of LIBS application fields, because machine learning algorithms can automatically obtain relevant information in spectra and establish robust quantization models.
Various machine learning algorithms have their unique strengths, some researchers have proposed some multi-algorithm models to improve the performance of LIBS quantitative analysis. The method combining two or more machine learning approaches to utilize their respective strengths to obtain the best analytical results. Meanwhile, the combined machine learning model presents the complex process of selecting suitable preprocessing methods and algorithms. However, the combined algorithms are computationally complex and time-consuming. For example, Ahmed
et al. [
158] adopted an ANN based on multi-line calibration (MLC-ANN) to improve the accuracy of LIBS quantitative analysis of aluminum alloys. Li
et al. [
159] proposed a multi-spectral line correction method based on ANN, which improves the accuracy and precision of LIBS analysis of steel compared to the traditional internal calibration method. Yang
et al. [
160] compared the prediction ability of the PLS, ANN, and PLS-ANN models in detecting nine essential element components in plant materials. The results show that the PLS-ANN model has the highest accuracy, the ANN model is the second, and the PLS model is the lowest. Li
et al. [
139] combined the standardization method and dominant factor based PLSR to improve the measurement accuracy of carbon content in coal with
R2, RMSEP, and ARE were 0.99, 1.63 wt.%, and 1.82%, respectively. Shabbir
et al. [
161] combined feature selection with BPNN for the analysis of raw rocks with RMSEPs of 1.6, 18, 101, and 162 ppm for Li, Rb, Sr, and Ba elements, respectively. Zhang
et al. [
162] established PLSR model, SVR model and PLS-SVR model separately for the prediction of gelatin adulteration ratios, the result reveals that the PLS-SVM model can be employed as a preferred method for the accurate estimation of edible gelatin adulteration. Huang
et al. [
163] introduced PCA and canonical correlation analysis (CCA) into SVR for the analysis of T91 steel specimens with different degrees of microstructure aging, and the maximum values of mean relative error (MRE), RSDs and RMSEPs were 2.47%, 2.94% and 6.14, respectively. Yu
et al. [
164] adopted a multivariate multispectral correction method, combining DP-LIB with BPNN, to establish a multivariate GA-BP-ANN correction method, which effectively reduced the ARE of predicted samples and further improved the accuracy of LIBS quantitative analysis. The above research confirms that combined models have advantages in improving the quantitative analysis ability of LIBS, but the complexity of algorithms and data processing time also increase accordingly. But what is worrying is that some uninitiated researchers lack the fundamental knowledge of the capabilities and limitations of these complex algorithms when using them, to the extent that they overlook their physical principles. If physical principles are disregarded in the process of using machine learning algorithms, it will result in key variables being independent of the properties of the element, but rather related to pollutants and/or background features.To improve the accuracy and robustness of LIBS quantitative model without departing from physical mechanisms, a state-of-the-art strategy is to incorporate physical principles into machine learning algorithms, that is, a hybrid algorithm of machine learning and physical principles. Ideally, the intensity of the characteristic line of the element to be measured is linearly correlated with its concentration in the sample, these characteristic lines play a dominant role in the model decisions compared to the lines of the matrix elements. Song
et al. [
165] proposed a schematic description of incorporating LIBS physical principles into machine learning as shown in Fig.6. The method used knowledge-based lines, related to analyte compositions, to build a linear physical principle-based model and adopts KELM to account for the residuals of the linear model. The residual error of the linear model is corrected by machine learning and chemometrics models. As knowledge-driven and data-driven models are combined for the final prediction, how important spectral lines influence the result can be intuitively explained. The hybrid model inherits the advantage of physical principle-based methods, which is robust over a wider range of sample matrices. Furthermore, its good model complexity ensures that the complexity and nonlinearity of data can be handled efficiently.
A typical example of incorporating physical principles into machine learning is PLS model based on dominant factor (DF-PLS) proposed by Wang
et al. [
166−
168]. The dominant factor is the major part of the concentration extracted from characteristic line intensity of the specific element based on physical principle: the linear relation between the line intensities and the elemental concentrations, non-linear self-absorption effects, and inter-element interference. The physical based dominant factor model increased the robustness and sample adaptivity of the final multivariate model. Combined with dominant factor, the PLS approach is further applied to minimize the residual errors by utilizing more spectral information to compensate for the fluctuations of plasma. It uses MLR to model the relationship between key emission lines and analyte concentration. The residual error of MLR is corrected by performing PLS on the full spectrum. The model combines advantages of both the univariate and PLS models. Li
et al. [
169] combined atomic and molecular emission spectra in the dominant factor to improve the quantification of coal. The method shows better performance than PLS in LIBS quantification tasks such as coal property analysis [
167,
170] and content determination of brass alloy [
171], and it has recently been combined with plasma images to correct for self-absorption [
172]. Based on the DF-PLS, a hybrid model was developed to identify known calibration samples from a self-adaptive spectral database [
173]. Firstly, the new spectra were standardized to reduce signal uncertainty, and the similarity between new and stored spectra was compared. Then, quantitative information of samples within and outside the database can be determined from the database directly and based on DF-PLS analysis, respectively. As the database was updated, the hybrid model can improve measurement reproducibility and reduce the measurement-to-measurement RSD. Further modifications to DF-PLS mainly include nonlinear extraction of the dominant factor and residual correction using non-linear models. The former extracts the dominant factor through a nonlinear transformation of line intensities [
174], while the latter uses machine learning methods such as SVR and KELM to increase the accuracy of residual correction [
175,
176]. The accuracy of dominant factor-based methods exceeds that of their non-dominant factor-based counterparts in coal property analysis tasks in most cases [
176]. In addition, the linear (physical principle-based) and nonlinear (data-driven) models in the hybrid model can be jointly optimized to improve the performance of LIBS quantitative analysis. Although machine learning is usually used as a black box, some study inspires that it essentially follows certain physical mechanisms, which need to be revealed and understood. Képeš
et al. [
177] applied various post-hoc interpretation techniques with the aim of interpreting the decision-making of CNN. They found synthetic spectra that yield perfect expected classification predictions and denoted the CNN can only learn meaningful spectroscopic features.
Eliminate the effects of matrix effects and self-absorption effects. If the test sample encounters severe matrix differences or there is self-absorption effect in the spectral lines of the analytical elements [
178], it is difficult to obtain effective quantitative analysis from conventional univariate calibration models based on physical mechanisms. Some machine learning algorithms can automatically extract relevant information from LIBS spectra to construct multivariate quantitative models, so it can be used to solve the matrix effects and self-absorption effects encountered in the quantitative analysis of LIBS. PLSR and PCR are essentially multiple linear regression methods, which can take into account chemical and physical matrix effects by including peak information of matrix elements in the model while eliminating redundancy and non-linear response to the analyte concentrations. For example, PLSR method has been used to quantitative analysis the of concentrations of CaO, MgO, Al
2O
3, and SiO
2 in hematite and limonite ore samples [
151]. The results showed that the PLSR models can compensate the matrix effects and obtained accurate quantitative analysis results. Amador-Hernandez
et al. [
179] applied PLSR and LIBS to quantify Au and Ag in precious metal, in their PLSR models, the spectral range where less strong resonance lines are observed is preferred since less self-absorption occurs. Death
et al. [
180] investigated the quantitative analysis of iron ore samples using PCR, the results confirmed that PCR can effectively reduce the effect of self-absorption on quantitative analysis. Zaytsev
et al. [
181] investigated the effectiveness of PCR in addressing matrix effects and spectral interference in quantitative analysis of LIBS. PCR provided good predictive capability in the spectral ranges where numerous matrix lines strongly interfered with analytical lines [
182]. PCR is a linear regression model, so the non-linear response of some portions of the LIBS spectra due to self-absorption may be partitioned within principal components which attract lower regression scores, and thus make less of a contribution to the calibration outcome than that of PCs containing non-self-absorbed spectral data. Huang
et al. [
183] reviewed the progress of LIBS application combined with machine learning methods for reducing matrix effect and self-absorption in soil analysis. Rethfeldt
et al. [
184] used univariate and multivariate regression method (interval PLS) to detect REE in minerals and soils by LIBS, iPLS method is better suited for the determination of REE contents in heterogeneous field samples. In the iPLS regression, only parts of the relevant element lines are included in the regression. Self-absorption and partial contamination of the flank of lines are excluded, resulting in improved regressions with higher coefficients of determination. Bhatt
et al. [
185] reviewed the performance of univariate and multivariate analyses methods in quantitative analysis of REE by LIBS. The review indicates that PLSR is one of the crucial multivariate analytical techniques for reducing the matrix effect. Kwapis
et al. [
186] reviewed the development of machine learning and LIBS measurements for nuclear applications. Multivariate techniques (PLSR and PCR) are used to address and mitigate the detrimental influence of matrix effects on predictions by including information from multiple emission lines up to the entire visible spectrum. PLS is very closely related to PCR, which is used to eliminate collinearity from LIBS spectra while simultaneously addressing overfitting through the reduction in dimensionality of the data set. Multiple separate PLS models have been developed to perform in situ online monitoring of elemental concentrations in molten salts. Above machine learning algorithms have been proven to be an effective correction of the chemical matrix effect, but physical matrix effect (surface roughness, hardness and heterogeneity) pose greater challenges to the model. Sun
et al. [
187] developed a transfer learning model training algorithm and demonstrate its effectiveness to overcome the physical matrix effect due to the change of sample physical state in LIBS analyses. This method will apply to analysis of rocks with LIBS in Mars explorations. They also found that samples with a same chemical composition but different physical forms are more appreciated for an efficient training of a transfer learning model.
It should be noted that PLSR or PCR method is difficult to establish a robust quantitative model when it cannot obtain sufficient linear correlation information from the spectrum with extremely strong variations in the degree of matrix or self-absorption effects. Therefore, some nonlinear machine learning techniques (ANN, CNN, BPNN, SVR, etc.) are now widely used to effectively solve these problems. These models do not account for causes and effects, and they automatically explain the correlation between the spectral intensity inputs and the elemental concentrations in the samples. An ANN is a non-linear machine learning technique, it can pose an advantage for modelling complex matrix effects and self-absorption by including non-linearity with a high-degree of flexibility. It was reported that ANNs showed the potential to account for effects of chemical and physical matrices and overlapped lines when major elemental compositions of rock samples were measured [
188]. As for self-absorption, ANNs can in principle account for these effects by modelling the non-linear relationship using a flexible statistical model. Sirven
et al. [
189] have confirmed that ANNs have advantages over conventional calibration curves and PLS especially in taking into account non-linearity between spectral intensities and the concentrations due to self-absorption in the plasma. The high variation of raw LIBS signal seriously reduces the accuracy and stability of the spectral analysis. To solve this problem, Xu
et al. [
190] applied CNN to predict soil type and soil properties based on the non-preprocessed LIBS spectra. The results confirmed that CNN models performed better in preventing overfitting than the conventional PLS combined with various spectral preprocessing approaches. Yang
et al. [
191] also proposed a robust least squares support vector machine (RLS-SVM) regression model to solve the data fluctuation in multiple measurements of LIBS. Through the improved segmented weighting function, the information on the spectral data in the normal distribution will be retained in the regression model while the information on the outliers will be restrained or removed.
Adaptable to detection under complex conditions. LIBS has the advantage of in-situ, fast, and remote monitoring,which allows it to adapt to complex environments such as high temperatures, space, deep sea, radiation, toxic and explosive. However, it is difficult for LIBS to obtain stable spectral signals in these complex environments, which makes the conventional calibration model useless, and the machine learning algorithm has good adaptability to extract relevant information from the complex spectrum to establish a robust calibration model. Yang
et al. [
192] proposed a method based on LIBS for measure transient surface temperatures, which holds great significance for fast sliding friction processes in linear electromagnetic propulsion, gun barrels, and high-speed trains. Three algorithms including (single-peak fitting) SPF, PLS, and BP-ANN were used to predict the surface temperature, and the results showed that BP-ANN performed best in the 1, 2, and 3 μs exposure time. Since LIBS has a remote measurement function, the development of LIBS sensors/systems for determining the elemental composition of molten phases has been a hot topic of research. Sun
et al. [
193] developed a system that comprises a Cassegrain telescope in addition to a double-pulse LIBS to study the quantitative analysis of Si, Mn, Cr, Ni, and V in molten steel samples, and PLSR calibration method offered better repeatability and accuracy than that of univariate calibration. In addition, the system was used for estimation of C, Si, and Mn in molten steel samples in an industrial oven [
194]. Lee
et al. [
195] investigated LIBS as a possible option for remote online monitoring difficult-to-access nuclear reactors based on molten salts. The height of molten salts is easily fluctuated by vibration, in this study, the machine learning (PLS and ANN) models trained with both focusing and defocusing data were constructed, and the best RMSEP of 0.0210–0.0316 wt% were obtained for Sr and Mo elements using ANN models. This is because the training and test data sets considered the defocusing, which significantly affected the nonlinear pattern. In addition, the defocusing measurement results in self-absorption, which can show the saturated calibration curve or reversible tendency due to the thick plasma formation. The results suggested that the nonlinear model is more suitable to predict the compositions of the molten salt fluctuated by vibration. ChemCam is one of the sensor systems included on the Mars Science Laboratory rover Curiosity that landed on Mars in August 2012. Gasda
et al. [
196] reported a calibration model for manganese using the LIBS instrument that is part of the ChemCam instrument suite onboard the NASA Curiosity Rover. The optimal calibration model uses the PLS and Least Absolute Shrinkage and Selection Operator (LASSO) multivariate techniques. The double blended multivariate model shows a RMSEP accuracy of 1.39 wt% MnO. China’s first Mars exploration mission, named Tianwen 1, landed on Mars on 15 May 2021. Yang
et al. [
197] investigated the performance of a designed deep CNN on datasets consisting of multi-distance spectra at eight different distances ranging from 2.0 to 5.0 m. More than 18 000 LIBS spectra were collected by a duplicate model of the mars surface composition detector (MarSCoDe) instrument for China’s Tianwen-1 Mars mission.
Underwater LIBS of submerged solids has been suffering from serious spectral deformation and shot-to-shot fluctuation. Some multivariate analysis, such as PCR and PLSR models, has been applied to improve the quantitative performance of underwater LIBS. Takahashi
et al. [
198,
199] have achieved PCR and PLS based quantification of underwater LIBS data of submerged alloy samples. The non-linear effects of excitation temperature fluctuations on the signals are treated as systematic errors in the analysis. The effect of these errors on the analytical performance is evaluated by applying PCR and PLS with a temperature segmented database. The results demonstrated that the proposed database segmentation can improve quantitative accuracy of the PCR and PLS models. Zheng’s group [
200] developed an underwater LIBS system named LIBSea II, which was deployed on a remotely operated vehicle (ROV) of Haima for deep sea trial. To reduce the matrix effect and the instability of LIBS signals, the MLR calibration model of Zn and Cu elements were constructed. The correlation coefficients (
R2) of correlation relationships between the predicted concentration and reference concentration are 0.989 and 0.979 for Zn and Cu, respectively. These results indicate that LIBSea II for in-situ direct detection and quantitative analysis of submerged solids in real seawater environment.
The real-time in-line quantitative analysis instruments are highly demanded in many industrial sectors. LIBS is uniquely positioned in this regard, but the complexity of the field environment can have a serious impact on analytical performance, and fortunately, machine learning algorithms have the potential to compensate for this shortcoming. For example, Li
et al. [
201] designed a LIBS setup with optimized optical route and PCA-PLS algorithm for real time and high-precision online determination of total iron TFe, silica SiO
2, aluminum oxide Al
2O
3, and phosphorus P in iron ore. In this work, the spectral pretreatment algorithm was optimized for baseline removal and spectral normalization. The overlapped window slide algorithm avoids the deformation of emission peaks in spectral baseline removal, and two normalization steps by total back area and total spectral intensity within the sub-channel are applied to improve the spectral data stabilization.
In summary, due to the matrix effect and the fluctuation of LIBS signal, the conventional univariate models for LIBS quantitative analysis presents poorer accuracy and precision than other similar spectral analysis techniques. The introducing of a machine learning algorithm can solve these problems to a certain extent. However, due to changes in the experimental environment, the matrix differences of unknown samples and trained samples, the analytical model based on machine learning presents low generalization ability. Thus, the combination of physical mechanisms and machine learning models has attracted the interest and attention of researchers in the LIBS field. To show the quantitative performance of kinds of machine learning algorithms in the LIBS field comprehensively, Tab.3 presents a summary of quantitative analysis using LIBS combined with machine learning algorithms, including the types of machine learning methods, improvement ways, methods for comparison, analyzed elements, quantitative results, and references. The aim of this list is to: (i) show which machine learning algorithms have been investigated in LIBS quantitative analysis; (ii) compare the ability of different algorithms to improve the performance of quantitative analysis, and provides a reference for future related research.
5 Summary and prospects
The research progress of machine learning in the LIBS field in recent decades was presented in this review, including data selection, variable selection, noise filtering, interference processing, and qualitative and quantitative analysis based on machine learning methods. For a successful analytical model, the selected input variables should contain or reflect the peak information of LIBS original data while minimizing random noise, background interference, self-absorption, and matrix effects to the greatest extent possible. What’s more, the machine learning algorithm should discriminate and reasonably utilize the input variables which are sensitive to analytical uncertainty, accuracy, precision, and generalization ability. To solve the application problems of machine learning algorithms in LIBS fields and simply the construction process of analytical models, a method chain including data preprocessing, physical principles, and intelligent modeling based on machine learning algorithms will be a promising way. It may replace traditional modeling methods, accelerate application field expansion and the improvement of analytical capabilities of LIBS technology. The application of machine learning in LIBS has the following characteristics:
1) Universality. Machine learning is versatile in the whole process of LIBS analysis, including data selection, feature selection and extraction, noise filtering, self-absorption correction, matrix effect suppression, sample identification, and quantitative analysis. It has played an important role in many applications filed, including geological exploration, industrial metallurgy, environmental pollution monitoring, food safety, and biomedicine. If there is enough data, selecting appropriate machine learning algorithms can greatly improve spectral stability and the accuracy of qualitative/quantitative analysis of the LIBS technique.
2) Specificity. All kinds of machine learning methods, such as unsupervised learning, supervised learning, and semi-supervised learning, can all be adapted to LIBS analysis. Although machine learning is effective for LIBS data processing and modeling, the same algorithm may have different analytical capabilities for samples with different matrix features. There is no general machine learning to solve all the LIBS problems. The selection of machine learning algorithms depends on specific application requirements. Therefore, selecting and optimizing machine learning algorithms is very important for LIBS analysis to achieve optimal analytical performance. When a single algorithm cannot meet these requirements, multiple algorithms can be combined, or a hybrid algorithm based on physical principles and machine learning can be established. When all above methods cannot meet these requirements, an improved or a new machine learning algorithm should be developed.
3) Selectivity. Although early machine learning was only used to solve the problem of multivariate simultaneous analysis, the application of the LIBS technique combined with machine learning methods gradually increased in terms of classification and regression analysis. An increasing number of studies indicate that combining different machine learning algorithms at different stages of LIBS data analysis is a new trend. Some methods may be very effective in preprocessing, but others may be beneficial for modeling, which includes classification and quantification. Therefore, selecting different methods at different stages is also very important and worth further research.
4) Limitation. Although machine learning is excellent in assisting LIBS analysis, it is not omnipotent. The insufficient number of samples and lack of physical principles make the model based on machine learning methods less robust, making it difficult to obtain accurate quantitative analysis of unknown samples outside the training set. Therefore, in the process of using machine learning algorithms, its limitations should also be fully recognized. If machine learning algorithms are abused for LIBS analysis regardless of physical mechanisms, it will result in the obtained material types or element contents not being related to the characteristic spectral information of their constituent elements, but may be related to information from noise, background, interference, or even pollutants.
5) Prospects. At present, although the machine learning has been proven to improve the analytical performance of LIBS, it depends on the specific applications. The reason is that the current training data is not enough for accurate prediction when the sample to be tested is unknown. The potential for overtraining is significant with LIBS spectral data, resulting in calibration or classification models that are less robust than even univariate models. Therefore, the future big model might provide a promising way to break the current limitations, the model will have better adaptability to uncertain factors such as matrix effect, self-absorption effect, noise interference, and equipment parameter drift. However, robust analytical models require a large number of training samples, and how to process massive spectral data is also an important problem to be studied in the application of machine learning algorithms in LIBS.