Neural partially linear additive model

Liangxuan ZHU, Han LI, Xuelin ZHANG, Lingjuan WU, Hong CHEN

PDF(8741 KB)
PDF(8741 KB)
Front. Comput. Sci. ›› 2024, Vol. 18 ›› Issue (6) : 186334. DOI: 10.1007/s11704-023-2662-3
Artificial Intelligence
RESEARCH ARTICLE

Neural partially linear additive model

Author information +
History +

Abstract

Interpretability has drawn increasing attention in machine learning. Most works focus on post-hoc explanations rather than building a self-explaining model. So, we propose a Neural Partially Linear Additive Model (NPLAM), which automatically distinguishes insignificant, linear, and nonlinear features in neural networks. On the one hand, neural network construction fits data better than spline function under the same parameter amount; on the other hand, learnable gate design and sparsity regular-term maintain the ability of feature selection and structure discovery. We theoretically establish the generalization error bounds of the proposed method with Rademacher complexity. Experiments based on both simulations and real-world datasets verify its good performance and interpretability.

Graphical abstract

Keywords

feature selection / structure discovery / partially linear additive model / neural network

Cite this article

Download citation ▾
Liangxuan ZHU, Han LI, Xuelin ZHANG, Lingjuan WU, Hong CHEN. Neural partially linear additive model. Front. Comput. Sci., 2024, 18(6): 186334 https://doi.org/10.1007/s11704-023-2662-3

Liangxuan Zhu received his BS degree from the North China Institute of Science and Technology, China in 2019. He is currently pursuing the MS degree at the College of Informatics, Huazhong Agricultural University, China. His current research interests lie in machine learning, including deep learning, learning theory and interpretability

Han Li received BS degree in Mathematics and Applied Mathematics from Faculty of Mathematics and Computer Science, Hubei University, China in 2007. She received her PhD degree in the School of Mathematics and Statistics at Beihang University, China. She worked as a project assistant professor in the Department of Mechanical Engineering, Kyushu University, Japan. She now works as an associate professor in the College of Informatics, Huazhong Agricultural University, China. Her research interests include neural networks, learning theory and pattern recognition

Xuelin Zhang received his BE degree from the China Agricultural University, China in 2019. He is currently a PhD student with the College of Science, Huazhong Agricultural University, China. His current research interests include robust machine learning and statistical learning theory

Lingjuan Wu got her PhD degree in Microelectronics and Solid State Electronics from Peking University, China in 2013. She visited the University of California, San Diego, USA as a research scholar from 2010 to 2012. She is currently an associate professor with the College of Informatics, Huazhong Agricultural University, China. Her research interests are in machine learning and hardware security, including learning theory and machine learning based side channel analysis and hardware Trojan detection

Hong Chen received the BS and PhD degrees from Hubei University, China in 2003 and 2009, respectively. He worked as a postdoc researcher at University of Texas, USA during 2016–2017. He is currently a professor with the College of Informatics, Huazhong Agricultural University, China. His current research interests include machine learning, statistical learning theory and approximation theory

References

[1]
Rudin C, Chen C, Chen Z, Huang H, Semenova L, Zhong C . Interpretable machine learning: fundamental principles and 10 grand challenges. Statistics Surveys, 2022, 16: 1–85
[2]
Du M, Liu N, Hu X . Techniques for interpretable machine learning. Communications of the ACM, 2019, 63( 1): 68–77
[3]
Rudin C . Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 2019, 1( 5): 206–215
[4]
Härdle W, Liang H, Gao J T. Partially Linear Models. Heidelberg: Physica, 2000
[5]
Xie Q, Liu J . Combined nonlinear effects of economic growth and urbanization on CO2 emissions in China: evidence from a panel data partially linear additive model. Energy, 2019, 186: 115868
[6]
Shim J H, Lee Y K . Generalized partially linear additive models for credit scoring. The Korean Journal of Applied Statistics, 2011, 24( 4): 587–595
[7]
Kazemi M, Shahsavani D, Arashi M . Variable selection and structure identification for ultrahigh-dimensional partially linear additive models with application to cardiomyopathy microarray data. Statistics, Optimization & Information Computing, 2018, 6( 3): 373–382
[8]
Zhang H H, Cheng G, Liu Y . Linear or nonlinear? Automatic structure discovery for partially linear models. Journal of the American Statistical Association, 2011, 106( 495): 1099–1112
[9]
Du P, Cheng G, Liang H . Semiparametric regression models with additive nonparametric components and high dimensional parametric components. Computational Statistics & Data Analysis, 2012, 56( 6): 2006–2017
[10]
Huang J, Wei F, Ma S . Semiparametric regression pursuit. Statistica Sinica, 2012, 22( 4): 1403–1426
[11]
Lou Y, Bien J, Caruana R, Gehrke J . Sparse partially linear additive models. Journal of Computational and Graphical Statistics, 2016, 25( 4): 1126–1140
[12]
Petersen A, Witten D . Data-adaptive additive modeling. Statistics in Medicine, 2019, 38( 4): 583–600
[13]
Sadhanala V, Tibshirani R J . Additive models with trend filtering. The Annals of Statistics, 2019, 47( 6): 3032–3068
[14]
Agarwal R, Melnick L, Frosst N, Zhang X, Lengerich B, Caruana R, Hinton G E. Neural additive models: Interpretable machine learning with neural nets. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021, 4699–4711
[15]
Nelder J A, Wedderburn R W M . Generalized linear models. Journal of the Royal Statistical Society. Series A (General), 1972, 135( 3): 370–384
[16]
Hastie T, Tibshirani R . Generalized additive models. Statistical Science, 1986, 1( 3): 297–310
[17]
Tibshirani R . Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 1996, 58( 1): 267–288
[18]
Ravikumar P, Lafferty J, Liu H, Wasserman L . Sparse additive models. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 2009, 71( 5): 1009–1030
[19]
Xu S Y, Bu Z Q, Chaudhari P, Barnett I J. Sparse neural additive model: Interpretable deep learning with feature selection via group sparsity. In: Proceedings of ICLR 2022 PAIR2Struct Workshop. 2022
[20]
Feng J, Simon N. Sparse-input neural networks for high-dimensional nonparametric regression and classification. 2017, arXiv preprint arXiv: 1711.07592v1
[21]
Lemhadri I, Ruan F, Abraham L, Tibshirani R . Lassonet: A neural network with feature sparsity. The Journal of Machine Learning Research, 2021, 22( 1): 127
[22]
Wang X, Chen H, Yan J, Nho K, Risacher S L, Saykin A J, Shen L, Huang H, ADNI . Quantitative trait loci identification for brain endophenotypes via new additive model with random networks. Bioinformatics, 2018, 34( 17): i866–i874
[23]
Nair V, Hinton G E. Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on International Conference on Machine Learning. 2010, 807–814
[24]
Huber P J. Robust estimation of a location parameter. In: Kotz S, Johnson N L, eds. Breakthroughs in statistics: Methodology and Distribution. New York: Springer, 1992, 492–518
[25]
Lu Y Y, Fan Y, Lv J, Noble W S. DeepPINK: reproducible feature selection in deep neural networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 8690–8700
[26]
Kingma D P, Ba J. Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. 2015
[27]
Golowich N, Rakhlin A, Shamir O. Size-independent sample complexity of neural networks. In: Proceedings of Conference on Learning Theory. 2018, 297–299
[28]
McDiarmid C. On the method of bounded differences. In: Siemons J, ed. Surveys in Combinatorics. Cambridge: Cambridge University Press, 1989, 148–188
[29]
Chen H, Wang Y, Zheng F, Deng C, Huang H . Sparse modal additive model. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32( 6): 2373–2387
[30]
Wang X, Chen H, Cai W, Shen D, Huang H. Regularized modal regression with applications in cognitive impairment prediction. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 1447–1457
[31]
Cucker F, Zhou D X. Learning Theory: An Approximation Theory Viewpoint. Cambridge: Cambridge University Press, 2007
[32]
Wu Q, Ying Y, Zhou D X . Learning rates of least-square regularized regression. Foundations of Computational Mathematics, 2006, 6( 2): 171–192
[33]
Krogh A . What are artificial neural networks?. Nature Biotechnology, 2008, 26( 2): 195–197
[34]
Ng A Y. Feature selection, L1 vs. L2 regularization, and rotational invariance. In: Proceedings of the 21st International Conference on Machine Learning. 2004, 78
[35]
Yang L, Lv S, Wang J . Model-free variable selection in reproducing kernel Hilbert space. The Journal of Machine Learning Research, 2016, 17( 1): 2885–2908
[36]
Aygun R C, Yavuz A G. Network anomaly detection with stochastically improved autoencoder based models. In: Proceedings of the 4th IEEE International Conference on Cyber Security and Cloud Computing. 2017, 193–198
[37]
Chicco D, Warrens M J, Jurman G . The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Computer Science, 2021, 7: e623
[38]
Lin Y, Tu Y, Dou Z . An improved neural network pruning technology for automatic modulation classification in edge devices. IEEE Transactions on Vehicular Technology, 2020, 69( 5): 5703–5706
[39]
Pace R K, Barry R . Sparse spatial autoregressions. Statistics & Probability Letters, 1997, 33( 3): 291–297
[40]
Hamidieh K . A data-driven statistical model for predicting the critical temperature of a superconductor. Computational Materials Science, 2018, 154: 346–354
[41]
Zhang S, Guo B, Dong A, He J, Xu Z, Chen S X . Cautionary tales on air-quality improvement in Beijing. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 2017, 473( 2205): 20170457
[42]
Harrison D Jr, Rubinfeld D L . Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 1978, 5( 1): 81–102
[43]
Buitinck L, Louppe G, Blondel M, Pedregosa F, Müeller A, Grisel O, Niculae V, Prettenhofer P, Gramfort A, Grobler J, Layton R, VanderPlas J, Joly A, Holt B, Varoquaux G. API design for machine learning software: experiences from the scikit-learn project. 2013, arXiv preprint arXiv: 1309.0238
[44]
Asuncion A, Newman D J. UCI machine learning repository. Irvine: Irvine University of California, 2017
[45]
Hazan E, Singh K. Boosting for online convex optimization. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 4140–4149
[46]
Couellan N . Probabilistic robustness estimates for feed-forward neural networks. Neural Networks, 2021, 142: 138–147
[47]
Konstantinov A V, Utkin L V . Interpretable machine learning with an ensemble of gradient boosting machines. Knowledge-Based Systems, 2021, 222: 106993
[48]
Xing Y F, Xu Y H, Shi M H, Lian Y X . The impact of PM2.5 on the human respiratory system. Journal of Thoracic Disease, 2016, 8( 1): E69–E74
[49]
Oune N, Bostanabad R . Latent map Gaussian processes for mixed variable metamodeling. Computer Methods in Applied Mechanics and Engineering, 2021, 387: 114128
[50]
Bekkar A, Hssina B, Douzi S, Douzi K . Air-pollution prediction in smart city, deep learning approach. Journal of Big Data, 2021, 8( 1): 161

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant No. 12071166), the Fundamental Research Funds for the Central Universities of China (Nos. 2662023LXPY005, 2662022XXYJ005), and HZAU-AGIS Cooperation Fund (No. SZYJY2023010).

Competing interests

The authors declare that they have no competing interests or financial conflicts to disclose.
Appendix A
In this section, we introduce the experimental environment and model architecture.
The experiments of Lasso are completed in sklearn 1.0.1. LassoNet package came primarily out of research in Rob Tibshirani’s lab at Stanford University. SpAM package is available at github. SPLAM is implemented with R 3.6.3. And the R package of SPLAT is available at Ashley Petersen’s website. All experiments of neural networks (including NAM, SNAM, FCNN, FCNN(l1), SPINN, and NPLAM) are completed in python 3.7, Tensorflow 2.2.0, NumPy 1.18.5, and sklearn 1.0.1 on an intel Core i7-10700 2.90 GHZ and NVIDIA GeForce GTX 1660 super with 32 GB memory. NPLAM codes will be publicly available at github.
For the 30-dimension dataset, the default specific network architecture of each sub-network of NPLAM is composed of three Dense layers, and their nodes are 1, 50, and 1. The default specific network architecture of each sub-network of NAM (SNAM) is composed of three dense layers, and their nodes are 1, 50, and 1. The default specific network architecture of FCNN (FCNN(l1)) is composed of three dense layers, and their nodes are 25, 25, and 1. The default specific network architecture of LassoNet is composed of two dense layers, and their nodes are 50 and 1. The default specific network architecture of SPINN is composed of three dense layers, and their nodes are 30, 50, and 1. For the selection strategy of hyperparameters, we use a coarse-to-fine search. Table A1 shows the initial grid range of the above models.

Table A1 Search Range of Initial Coarse Grids for different methods

MethodsRegularization*ParametersSearch range of initial coarse grids
Lasso Coefficient of lasso penalty term. {105,104,103,102,101,100}
NAM - Learning rate. {104,103,102,101,100}
SNAM Coefficient of l1 penalty term for last hidden layer weight. {104,103,102,101,100,101}
Learning rate. {104,103,102,101,100}
FCNN - Learning rate. {104,103,102,101,100}
FCNN(l1) Coefficient of l1 penalty term for first hidden layer weight. {104,103,102,101,100,101}
Learning rate. {105,104,103,102,101,100}
SPINN Coefficient of sparse group lasso penalty term. {104,103,102,101,100,101}
Learning rate. {104,103,102,101,100}
SPLAM Coefficient of SPLAM penalty term. {105,104,103,102,101,100}
SPLAT Alpha which controls the strength of the linear fit. {0.1,0.3,0.5,0.7,0.9}
Number of lambda values. {5,10,15,20,25,30}
NPLAM(our) Coefficient of l1 penalty term for the feature selection gates g1. {104,103,102,101,100,101}
Coefficient of l1 penalty term for the structure discovery gates g2. {104,103,102,101,100,101}
Coefficient of l1 penalty term for the weight of neural network. {104,103,102,101,100,101}
Leaning rate. {104,103,102,101,100}
Before training, we normalize the training sample set and validation set. During training, we use the Adam optimizer on the neural network model (the gradient is set to 0 at 0). The best model is saved as the prediction model by validation set. After each training, we use the test set to evaluate the performance of each model.
Appendix B
In this section, we introduce the evaluation metrics.
For evaluating the approximation ability, we select mean squared error (MSE) and standard deviations (STD). The MSE is:
MSE=1mi=1m(f(xi)yi)2,
where m is the number of samples, f(xi) is the predicted value of the ith sample, and yi is the true value of the ith sample.
Then we adopt the same evaluation metrics in [35] for feature selection and structure discovery. Table A2 shows the confusion matrix. For feature selection, we report five metrics: TP (average number of selected truly positive features), FP (average number of selected falsely positive features), TN (average number of selected truly negative features), FN (average number of selected falsely negative features), and F1 (harmonic mean of precision and recall of positive features):

Table A2 Confusion matrix

Actual positive Actual negative
Predicted positive TP FP
Predicted negative FN TN
F1=2TP2TP+FP+FN.
For structure discovery, we report three metrics: CF (correct-fitting), UF (under-fitting, indicating nonlinear feature misidentified as linear), and OF (over-fitting, indicating linear feature misidentified as nonlinear).
To display the results of feature selection and structure discovery, we define three metrics: √ represents a significant impact of input features on response. L (linear) and N (nonlinear) represent that input features have a linear impact and a nonlinear impact on response, respectively.
For a quantitative comparison of model approximation ability, we introduce the coefficient of determination (R2). First, yi and fi represent the true data and predicted values, respectively. And m represents the number of all yi. Then, we can get the mean of all true data: y^=1mi=1myi. The data set of yi can be measured with two sums of squares formulas: residual sum of squares SSR=i(yifi)2 and total sum of squares SST=i(yiy^)2. Finally, we obtain the definition of the R2:
R2=1SSRSST.
Then, we use FLOPs to denote floating-point operations, which measure the number of operations of the deep learning network.

Table A3 Detailed results of simulation

30-dimension simulation
Methodsx1(N)x2(N)x3(N)x4(L)x5(L)Other(-)
Methods for feature selection Lasso -
NAM -
SNAM -
FCNN -
FCNN(l1) - -
LassoNet - -
SpAM -
SPINN - - -
Methods for feature selection & structure discovery SPLAM √(N) √(L) - √(L) √(L)
SPLAT √(N) √(L) - √(L) √(L)
NPLAM √(N) √(N) √(N) √(L) √(L) -
30-dimension simulation
Methods TP() TN() FP() FN() F1() CF() UF() OF() MSE(STD)()
Methods for feature selection Lasso 4.1 23.3 1.7 0.9 0.759 - - - 0.0394(±0.0022)
NAM 3.7 15.5 9.5 1.3 0.407 - - - 0.0122(±0.0017)
SNAM 5.0 25.0 0.0 0.0 1.000 - - - 0.0037(±0.0009)
FCNN 3.3 24.3 0.7 1.7 0.733 - - - 0.0360(±0.0029)
FCNN(l1) 3.4 24.2 0.8 1.6 0.739 - - - 0.0297(±0.0032)
LassoNet 3.8 24.7 0.3 1.2 0.835 - - - 0.0327(±0.0024)
SpAM 5.0 25.0 0.0 0.0 1.000 - - - 0.0226(±0.0039)
SPINN 3.5 24.8 0.2 1.5 0.805 - - - 0.0294(±0.0015)
Methods for feature selection & structure discovery SPLAM 3.1 25 0.0 1.9 0.765 0.900 0.100 0.000 0.1494(±0.0133)
SPLAT 4.3 25 0 0.7 0.925 0.943 0.057 0.000 0.0178(±0.0040)
NPLAM 5.0 25.0 0.0 0.0 1.000 1.000 0.000 0.000 0.0025(±0.0002)
300-dimension simulation
Methods x1(N) x2(N) x3(N) x4(L) x5(L) Other(-)
Methods for feature selection Lasso - -
NAM
FCNN -
FCNN(l1) - - -
LassoNet -
SpAM -
SPINN - - -
Methods for feature selection & structure discovery SPLAM √(N) √(L) - √(L) √(L)
SPLAT √(L) √(L) - √(L) √(L) -
NPLAM √(N) √(N) √(N) √(L) √(L) -
300-dimension simulation
Methods TP() TN() FP() FN() F1() CF() UF() OF() MSE(STD)()
Methods for feature selection Lasso 4.0 295.0 0.0 1.0 0.889 - - - 0.0290(±0.0014)
NAM 4.6 214.5 80.5 0.4 0.102 - - - 0.0057(±0.0012)
SNAM 5.0 295.0 0.0 0.0 1.000 - - - 0.0032(±0.0002)
FCNN 3.9 294.5 0.5 1.1 0.830 - - - 0.0180(±0.0019)
FCNN(l1) 2.8 294.3 0.7 2.2 0.659 - - - 0.0167(±0.0025)
LassoNet 3.4 289.5 5.5 1.6 0.489 - - - 0.0268(±0.0011)
SpAM 5.0 295.0 0.0 0.0 1.000 - - - 0.0149(±0.0023)
SPINN 1.9 294.1 0.9 3.1 0.487 - - - 0.0357(±0.0024)
Methods for feature selection & structure discovery SPLAM 4.1 294.9 0.1 0.9 0.756 0.990 0.007 0.003 0.0978(±0.0555)
SPLAT 3.2 295 0 0.8 0.889 0.990 0.010 0.000 0.0245(±0.0007)
NPLAM 5.0 295.0 0.0 0.0 1.000 1.000 0.000 0.000 0.0024(±0.0002)
Appendix C
We conduct the comparison of different methods on the 30-dimension simulation dataset and the 300-dimension simulation dataset, and present the results in Table A3. The top of these tables show the results of different methods for judging important features and corresponding structures. The bottom of these tables present the quantification results of different methods.
Appendix D
For the third ablation experiment, we generate three datasets, including: full linear feature dataset, full nonlinear feature dataset, and partially linear feature dataset. Full linear feature dataset is generated according to 30 features, and the constructor is
y=2x1+3x22.5x33.5x4+4x5+ϵ,
where ϵN(0,1) and all feature are generated uniformly in [2.5,2.5]. This dataset only has 5 features which are valid and linear. Full nonlinear feature dataset is generated according to 30 features, and the constructor is
y=2sin(2x1)+x22+ex3+3cos3x42x52+ϵ,
where ϵN(0,1) and all feature are generated uniformly in [2.5,2.5]. Similar to above dateset, this dataset only has 5 features which are valid and nonlinear. Partially linear feature dataset is generated according to
y=2x1+3x22.5x33.5x4+4x52x63x7+2.5x8+3.5x94x10+1.5x11+2.5x122x133x14+3.5x15+2sin(2x16)+x172exp(x18)+3cos(3x19)2x2022sin(2x21)x222exp(x23)3cos(3x24)+2x252+3sin(2x26)+1.5x272+1.5exp(x28)+4cos(2x29)+x302+ϵ,
where ϵN(0,1) and all features are generated uniformly in [2.5,2.5]. The first 15 features have the linear structure and the last 15 features have the nonlinear structure.
Appendix E
In this section, we introduce the attribute information of the real-world datasets.
California Housing [39] dataset contains 20,640 observations with 9 numerical features, including:
● LON : The longitude of the house;
● LAT : The latitude of the house;
● HMA : The median age of the house, and the lower number is newer house;
● TR : Total number of rooms;
● TB : Total number of bedrooms;
● POP : Total number of people residing;
● HH : Total number of households;
● MI : Median income for households of houses (measured in US Dollars);
● median price of houses: Median house value for households (measured in US Dollars).
Super-Conductivity [40] dataset comes from Japan’s National Institute for Materials Science, and contains 82 features extracted from 21,263 superconductors. In this paper, we use shorthand to represent partial features, including:
● RAR : range atomic radius;
● SAR : std atomic radius;
● WET : wtd entropy ThermalConductivity;
● WMV : wtd mean Valence;
● WGV : wtd std Valence;
where std is stand deviation and wtd is weighted.
Beijing Air Quality [41] dataset comes from Beijing, China, which has six main air pollutants and six relevant meteorological features, including:
● PM2.5 : PM2.5 concentration (μg/m3);
● PM10: PM10 concentration (μg/m3);
SO2: SO2 concentration (μg/m3);
NO2: NO2 concentration (μg/m3);
CO: CO concentration (μg/m3);
O3: O3 concentration (μg/m3);
● TEM: temperature (degree Celsius);
● PRE: pressure (hPa);
● DEW: dew point temperature (degree Celsius);
● RA: precipitation (mm);
● WSPM: wind speed (m/s);
● WD: wind direction.
Boston Housing [42] dataset comes from the 506 census tracts of Boston from the 1970 census and contains 14 variables, including:
● CRIM : Per capita crime rate by town;
● ZN : Proportion of residential land zoned for lots over 25, 000 sq.ft;
● INDUS : Proportion of non-retail business acres per town;
● CHAS : Charles River dummy variable;
● NOX : Nitric oxides concentration;
● RM : Average number of rooms per dwelling;
● AGE : Proportion of owner-occupied units built prior to 1940;
● DIS : Weighted distances to five Boston employment centres;
● RAD : Index of accessibility to radial highways;
● TAX : Full-value property-tax rate per fanxiexian_myfh10, 000;
● PTRATIO : Pupil-teacher ratio by town;
● B : 1000(Bk0.63)2 where Bk is the proportion by town;
● LSTAR : Lower status of the population;
● MEDV : Median value of owner-occupied homes in fanxiexian_myfh1000’s.

RIGHTS & PERMISSIONS

2024 Higher Education Press
AI Summary AI Mindmap
PDF(8741 KB)

Accesses

Citations

Detail

Sections
Recommended

/