Neural partially linear additive model
Liangxuan ZHU, Han LI, Xuelin ZHANG, Lingjuan WU, Hong CHEN
Neural partially linear additive model
Interpretability has drawn increasing attention in machine learning. Most works focus on post-hoc explanations rather than building a self-explaining model. So, we propose a Neural Partially Linear Additive Model (NPLAM), which automatically distinguishes insignificant, linear, and nonlinear features in neural networks. On the one hand, neural network construction fits data better than spline function under the same parameter amount; on the other hand, learnable gate design and sparsity regular-term maintain the ability of feature selection and structure discovery. We theoretically establish the generalization error bounds of the proposed method with Rademacher complexity. Experiments based on both simulations and real-world datasets verify its good performance and interpretability.
feature selection / structure discovery / partially linear additive model / neural network
Liangxuan Zhu received his BS degree from the North China Institute of Science and Technology, China in 2019. He is currently pursuing the MS degree at the College of Informatics, Huazhong Agricultural University, China. His current research interests lie in machine learning, including deep learning, learning theory and interpretability
Han Li received BS degree in Mathematics and Applied Mathematics from Faculty of Mathematics and Computer Science, Hubei University, China in 2007. She received her PhD degree in the School of Mathematics and Statistics at Beihang University, China. She worked as a project assistant professor in the Department of Mechanical Engineering, Kyushu University, Japan. She now works as an associate professor in the College of Informatics, Huazhong Agricultural University, China. Her research interests include neural networks, learning theory and pattern recognition
Xuelin Zhang received his BE degree from the China Agricultural University, China in 2019. He is currently a PhD student with the College of Science, Huazhong Agricultural University, China. His current research interests include robust machine learning and statistical learning theory
Lingjuan Wu got her PhD degree in Microelectronics and Solid State Electronics from Peking University, China in 2013. She visited the University of California, San Diego, USA as a research scholar from 2010 to 2012. She is currently an associate professor with the College of Informatics, Huazhong Agricultural University, China. Her research interests are in machine learning and hardware security, including learning theory and machine learning based side channel analysis and hardware Trojan detection
Hong Chen received the BS and PhD degrees from Hubei University, China in 2003 and 2009, respectively. He worked as a postdoc researcher at University of Texas, USA during 2016–2017. He is currently a professor with the College of Informatics, Huazhong Agricultural University, China. His current research interests include machine learning, statistical learning theory and approximation theory
[1] |
Rudin C, Chen C, Chen Z, Huang H, Semenova L, Zhong C . Interpretable machine learning: fundamental principles and 10 grand challenges. Statistics Surveys, 2022, 16: 1–85
|
[2] |
Du M, Liu N, Hu X . Techniques for interpretable machine learning. Communications of the ACM, 2019, 63( 1): 68–77
|
[3] |
Rudin C . Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 2019, 1( 5): 206–215
|
[4] |
Härdle W, Liang H, Gao J T. Partially Linear Models. Heidelberg: Physica, 2000
|
[5] |
Xie Q, Liu J . Combined nonlinear effects of economic growth and urbanization on CO2 emissions in China: evidence from a panel data partially linear additive model. Energy, 2019, 186: 115868
|
[6] |
Shim J H, Lee Y K . Generalized partially linear additive models for credit scoring. The Korean Journal of Applied Statistics, 2011, 24( 4): 587–595
|
[7] |
Kazemi M, Shahsavani D, Arashi M . Variable selection and structure identification for ultrahigh-dimensional partially linear additive models with application to cardiomyopathy microarray data. Statistics, Optimization & Information Computing, 2018, 6( 3): 373–382
|
[8] |
Zhang H H, Cheng G, Liu Y . Linear or nonlinear? Automatic structure discovery for partially linear models. Journal of the American Statistical Association, 2011, 106( 495): 1099–1112
|
[9] |
Du P, Cheng G, Liang H . Semiparametric regression models with additive nonparametric components and high dimensional parametric components. Computational Statistics & Data Analysis, 2012, 56( 6): 2006–2017
|
[10] |
Huang J, Wei F, Ma S . Semiparametric regression pursuit. Statistica Sinica, 2012, 22( 4): 1403–1426
|
[11] |
Lou Y, Bien J, Caruana R, Gehrke J . Sparse partially linear additive models. Journal of Computational and Graphical Statistics, 2016, 25( 4): 1126–1140
|
[12] |
Petersen A, Witten D . Data-adaptive additive modeling. Statistics in Medicine, 2019, 38( 4): 583–600
|
[13] |
Sadhanala V, Tibshirani R J . Additive models with trend filtering. The Annals of Statistics, 2019, 47( 6): 3032–3068
|
[14] |
Agarwal R, Melnick L, Frosst N, Zhang X, Lengerich B, Caruana R, Hinton G E. Neural additive models: Interpretable machine learning with neural nets. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021, 4699–4711
|
[15] |
Nelder J A, Wedderburn R W M . Generalized linear models. Journal of the Royal Statistical Society. Series A (General), 1972, 135( 3): 370–384
|
[16] |
Hastie T, Tibshirani R . Generalized additive models. Statistical Science, 1986, 1( 3): 297–310
|
[17] |
Tibshirani R . Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 1996, 58( 1): 267–288
|
[18] |
Ravikumar P, Lafferty J, Liu H, Wasserman L . Sparse additive models. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 2009, 71( 5): 1009–1030
|
[19] |
Xu S Y, Bu Z Q, Chaudhari P, Barnett I J. Sparse neural additive model: Interpretable deep learning with feature selection via group sparsity. In: Proceedings of ICLR 2022 PAIR2Struct Workshop. 2022
|
[20] |
Feng J, Simon N. Sparse-input neural networks for high-dimensional nonparametric regression and classification. 2017, arXiv preprint arXiv: 1711.07592v1
|
[21] |
Lemhadri I, Ruan F, Abraham L, Tibshirani R . Lassonet: A neural network with feature sparsity. The Journal of Machine Learning Research, 2021, 22( 1): 127
|
[22] |
Wang X, Chen H, Yan J, Nho K, Risacher S L, Saykin A J, Shen L, Huang H, ADNI . Quantitative trait loci identification for brain endophenotypes via new additive model with random networks. Bioinformatics, 2018, 34( 17): i866–i874
|
[23] |
Nair V, Hinton G E. Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on International Conference on Machine Learning. 2010, 807–814
|
[24] |
Huber P J. Robust estimation of a location parameter. In: Kotz S, Johnson N L, eds. Breakthroughs in statistics: Methodology and Distribution. New York: Springer, 1992, 492–518
|
[25] |
Lu Y Y, Fan Y, Lv J, Noble W S. DeepPINK: reproducible feature selection in deep neural networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 8690–8700
|
[26] |
Kingma D P, Ba J. Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. 2015
|
[27] |
Golowich N, Rakhlin A, Shamir O. Size-independent sample complexity of neural networks. In: Proceedings of Conference on Learning Theory. 2018, 297–299
|
[28] |
McDiarmid C. On the method of bounded differences. In: Siemons J, ed. Surveys in Combinatorics. Cambridge: Cambridge University Press, 1989, 148–188
|
[29] |
Chen H, Wang Y, Zheng F, Deng C, Huang H . Sparse modal additive model. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32( 6): 2373–2387
|
[30] |
Wang X, Chen H, Cai W, Shen D, Huang H. Regularized modal regression with applications in cognitive impairment prediction. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 1447–1457
|
[31] |
Cucker F, Zhou D X. Learning Theory: An Approximation Theory Viewpoint. Cambridge: Cambridge University Press, 2007
|
[32] |
Wu Q, Ying Y, Zhou D X . Learning rates of least-square regularized regression. Foundations of Computational Mathematics, 2006, 6( 2): 171–192
|
[33] |
Krogh A . What are artificial neural networks?. Nature Biotechnology, 2008, 26( 2): 195–197
|
[34] |
Ng A Y. Feature selection, L1 vs. L2 regularization, and rotational invariance. In: Proceedings of the 21st International Conference on Machine Learning. 2004, 78
|
[35] |
Yang L, Lv S, Wang J . Model-free variable selection in reproducing kernel Hilbert space. The Journal of Machine Learning Research, 2016, 17( 1): 2885–2908
|
[36] |
Aygun R C, Yavuz A G. Network anomaly detection with stochastically improved autoencoder based models. In: Proceedings of the 4th IEEE International Conference on Cyber Security and Cloud Computing. 2017, 193–198
|
[37] |
Chicco D, Warrens M J, Jurman G . The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Computer Science, 2021, 7: e623
|
[38] |
Lin Y, Tu Y, Dou Z . An improved neural network pruning technology for automatic modulation classification in edge devices. IEEE Transactions on Vehicular Technology, 2020, 69( 5): 5703–5706
|
[39] |
Pace R K, Barry R . Sparse spatial autoregressions. Statistics & Probability Letters, 1997, 33( 3): 291–297
|
[40] |
Hamidieh K . A data-driven statistical model for predicting the critical temperature of a superconductor. Computational Materials Science, 2018, 154: 346–354
|
[41] |
Zhang S, Guo B, Dong A, He J, Xu Z, Chen S X . Cautionary tales on air-quality improvement in Beijing. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 2017, 473( 2205): 20170457
|
[42] |
Harrison D Jr, Rubinfeld D L . Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 1978, 5( 1): 81–102
|
[43] |
Buitinck L, Louppe G, Blondel M, Pedregosa F, Müeller A, Grisel O, Niculae V, Prettenhofer P, Gramfort A, Grobler J, Layton R, VanderPlas J, Joly A, Holt B, Varoquaux G. API design for machine learning software: experiences from the scikit-learn project. 2013, arXiv preprint arXiv: 1309.0238
|
[44] |
Asuncion A, Newman D J. UCI machine learning repository. Irvine: Irvine University of California, 2017
|
[45] |
Hazan E, Singh K. Boosting for online convex optimization. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 4140–4149
|
[46] |
Couellan N . Probabilistic robustness estimates for feed-forward neural networks. Neural Networks, 2021, 142: 138–147
|
[47] |
Konstantinov A V, Utkin L V . Interpretable machine learning with an ensemble of gradient boosting machines. Knowledge-Based Systems, 2021, 222: 106993
|
[48] |
Xing Y F, Xu Y H, Shi M H, Lian Y X . The impact of PM2.5 on the human respiratory system. Journal of Thoracic Disease, 2016, 8( 1): E69–E74
|
[49] |
Oune N, Bostanabad R . Latent map Gaussian processes for mixed variable metamodeling. Computer Methods in Applied Mechanics and Engineering, 2021, 387: 114128
|
[50] |
Bekkar A, Hssina B, Douzi S, Douzi K . Air-pollution prediction in smart city, deep learning approach. Journal of Big Data, 2021, 8( 1): 161
|
Table A1 Search Range of Initial Coarse Grids for different methods
Methods | Regularization* | Parameters | Search range of initial coarse grids |
---|---|---|---|
Lasso | √ | Coefficient of lasso penalty term. | |
NAM | - | Learning rate. | |
SNAM | √ | Coefficient of penalty term for last hidden layer weight. | |
Learning rate. | |||
FCNN | - | Learning rate. | |
FCNN() | √ | Coefficient of penalty term for first hidden layer weight. | |
Learning rate. | |||
SPINN | √ | Coefficient of sparse group lasso penalty term. | |
Learning rate. | |||
SPLAM | √ | Coefficient of SPLAM penalty term. | |
SPLAT | √ | Alpha which controls the strength of the linear fit. | |
Number of lambda values. | |||
NPLAM(our) | √ | Coefficient of penalty term for the feature selection gates . | |
Coefficient of penalty term for the structure discovery gates . | |||
Coefficient of penalty term for the weight of neural network. | |||
Leaning rate. |
Table A2 Confusion matrix
Actual positive | Actual negative | |
---|---|---|
Predicted positive | TP | FP |
Predicted negative | FN | TN |
Table A3 Detailed results of simulation
30-dimension simulation | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Methods | (N) | (N) | (N) | (L) | (L) | Other(-) | ||||
Methods for feature selection | Lasso | √ | √ | - | √ | √ | √ | |||
NAM | √ | √ | √ | √ | - | √ | ||||
SNAM | √ | √ | √ | √ | √ | - | ||||
FCNN | - | √ | √ | √ | √ | √ | ||||
FCNN() | - | √ | √ | - | √ | √ | ||||
LassoNet | - | √ | √ | √ | √ | - | ||||
SpAM | √ | √ | √ | √ | √ | - | ||||
SPINN | - | √ | - | √ | √ | - | ||||
Methods for feature selection & structure discovery | SPLAM | √(N) | √(L) | - | √(L) | √(L) | √ | |||
SPLAT | √(N) | √(L) | - | √(L) | √(L) | √ | ||||
NPLAM | √(N) | √(N) | √(N) | √(L) | √(L) | - | ||||
30-dimension simulation | ||||||||||
Methods | TP() | TN() | FP() | FN() | F1() | CF() | UF() | OF() | MSE(STD)() | |
Methods for feature selection | Lasso | 4.1 | 23.3 | 1.7 | 0.9 | 0.759 | - | - | - | 0.0394(±0.0022) |
NAM | 3.7 | 15.5 | 9.5 | 1.3 | 0.407 | - | - | - | 0.0122(±0.0017) | |
SNAM | 5.0 | 25.0 | 0.0 | 0.0 | 1.000 | - | - | - | 0.0037(±0.0009) | |
FCNN | 3.3 | 24.3 | 0.7 | 1.7 | 0.733 | - | - | - | 0.0360(±0.0029) | |
FCNN() | 3.4 | 24.2 | 0.8 | 1.6 | 0.739 | - | - | - | 0.0297(±0.0032) | |
LassoNet | 3.8 | 24.7 | 0.3 | 1.2 | 0.835 | - | - | - | 0.0327(±0.0024) | |
SpAM | 5.0 | 25.0 | 0.0 | 0.0 | 1.000 | - | - | - | 0.0226(±0.0039) | |
SPINN | 3.5 | 24.8 | 0.2 | 1.5 | 0.805 | - | - | - | 0.0294(±0.0015) | |
Methods for feature selection & structure discovery | SPLAM | 3.1 | 25 | 0.0 | 1.9 | 0.765 | 0.900 | 0.100 | 0.000 | 0.1494(±0.0133) |
SPLAT | 4.3 | 25 | 0 | 0.7 | 0.925 | 0.943 | 0.057 | 0.000 | 0.0178(±0.0040) | |
NPLAM | 5.0 | 25.0 | 0.0 | 0.0 | 1.000 | 1.000 | 0.000 | 0.000 | 0.0025(±0.0002) | |
300-dimension simulation | ||||||||||
Methods | (N) | (N) | (N) | (L) | (L) | Other(-) | ||||
Methods for feature selection | Lasso | √ | √ | - | √ | √ | - | |||
NAM | √ | √ | √ | √ | √ | √ | ||||
FCNN | - | √ | √ | √ | √ | √ | ||||
FCNN() | - | √ | √ | - | √ | - | ||||
LassoNet | √ | √ | - | √ | √ | √ | ||||
SpAM | √ | √ | √ | √ | √ | - | ||||
SPINN | - | √ | - | √ | √ | - | ||||
Methods for feature selection & structure discovery | SPLAM | √(N) | √(L) | - | √(L) | √(L) | √ | |||
SPLAT | √(L) | √(L) | - | √(L) | √(L) | - | ||||
NPLAM | √(N) | √(N) | √(N) | √(L) | √(L) | - | ||||
300-dimension simulation | ||||||||||
Methods | TP() | TN() | FP() | FN() | F1() | CF() | UF() | OF() | MSE(STD)() | |
Methods for feature selection | Lasso | 4.0 | 295.0 | 0.0 | 1.0 | 0.889 | - | - | - | 0.0290(±0.0014) |
NAM | 4.6 | 214.5 | 80.5 | 0.4 | 0.102 | - | - | - | 0.0057(±0.0012) | |
SNAM | 5.0 | 295.0 | 0.0 | 0.0 | 1.000 | - | - | - | 0.0032(±0.0002) | |
FCNN | 3.9 | 294.5 | 0.5 | 1.1 | 0.830 | - | - | - | 0.0180(±0.0019) | |
FCNN() | 2.8 | 294.3 | 0.7 | 2.2 | 0.659 | - | - | - | 0.0167(±0.0025) | |
LassoNet | 3.4 | 289.5 | 5.5 | 1.6 | 0.489 | - | - | - | 0.0268(±0.0011) | |
SpAM | 5.0 | 295.0 | 0.0 | 0.0 | 1.000 | - | - | - | 0.0149(±0.0023) | |
SPINN | 1.9 | 294.1 | 0.9 | 3.1 | 0.487 | - | - | - | 0.0357(±0.0024) | |
Methods for feature selection & structure discovery | SPLAM | 4.1 | 294.9 | 0.1 | 0.9 | 0.756 | 0.990 | 0.007 | 0.003 | 0.0978(±0.0555) |
SPLAT | 3.2 | 295 | 0 | 0.8 | 0.889 | 0.990 | 0.010 | 0.000 | 0.0245(±0.0007) | |
NPLAM | 5.0 | 295.0 | 0.0 | 0.0 | 1.000 | 1.000 | 0.000 | 0.000 | 0.0024(±0.0002) |
/
〈 | 〉 |