PDF
Abstract
Feature selection is a changing issue for varying coefficient models when the dimensionality of covariates is ultrahigh. The traditional technology of significantly reducing dimensionality is the marginal correlation screening method based on nonparametric smoothing. However, marginal correlation screening methods may be screen out variables that are jointly correlated to the response. To address this, we propose a novel screener with the name of group screening via nonparametric smoothing high-dimensional ordinary least squares projection, referred to as “Group HOLP” and study its sure screening property. Based on this nice property, we introduce a refined feature selection procedure via employing the extended Bayesian information criteria (EBIC) to select the suitable submodels in varying coefficient models, which is coined as Group HOLP-EBIC method. Under some regularity conditions, we establish the strong consistency of feature selection for the proposed method. The performance of our method is evaluated by simulations and further illustrated by two real examples.
Keywords
Varying coefficient models
/
Feature screening
/
Nonparametric smoothing
/
Extended Bayesian information criteria
/
High-dimensional ordinary least squares projection
Cite this article
Download citation ▾
Haofeng Wang, Hongxia Jin, Xuejun Jiang.
Feature Selection for High-Dimensional Varying Coefficient Models via Ordinary Least Squares Projection.
Communications in Mathematics and Statistics, 2023, 13(3): 607-648 DOI:10.1007/s40304-022-00326-2
| [1] |
BolstadB, IrizarryR, AstrandM, et al. . A comparison of normalization mehtods for high density oligonucleotide array data based on bias and variance. Bioinformatics, 2003, 19: 185-193.
|
| [2] |
CandesE, TaoT. The Dantzig selector: statistical estimation when p\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$p$$\end{document} is much larger than n\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$n$$\end{document}. Ann. Stat., 2007, 35: 2313-2351
|
| [3] |
ChiangAP, BeckJS, YenH, et al. . Homozygosity mapping with SNP arrays identifies TRIM32, an E3 ubiquitin ligase, as a Bardet-Biedl syndrome gene (BBS11). Proceed. Nat. Acad. Sci. USA, 2006, 103: 6287-6292.
|
| [4] |
ChenJ, ChenZ. Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 2008, 95: 759-771.
|
| [5] |
ChenJ, ChenZ. Extended BIC for small-n-large-P sparse GLM. Stat. Sinica, 2012, 22: 555-574.
|
| [6] |
ChengMY, HondaT, ZhangJT. Forward variable selection for sparse ultra-high dimensional varying coefficient models. J. Am. Stat. Associat., 2016, 111: 1209-1221.
|
| [7] |
ChenY, BaiY, FungWK. Structural identification and variable selection in high-dimensional varying-coefficient models. J. Nonpar. Stat., 2017, 29: 258-279.
|
| [8] |
FanJ, FengY, SongR. Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Am. Stat. Associat., 2011, 106: 544-557.
|
| [9] |
FanJ, LiR. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Associat., 2001, 96: 1348-1360.
|
| [10] |
FanJ, LvJ. Sure independence screening for ultrahigh dimensional feature space. J. Royal Stat. Soci.: Ser. B (Stat. Methodol.), 2008, 70: 849-911.
|
| [11] |
FanJ, MaY, DaiW. Nonparametric independence screening in sparse ultra-high-dimensional varying coefficient models. J. Am. Stat. Associat., 2014, 109: 1270-1284.
|
| [12] |
FanJ, SamworthR, WuY. Ultrahigh dimensional feature selection: beyond the linear model. J. Mach. Learn. Res., 2009, 10: 2013-2038
|
| [13] |
FanJ, ZhangW. Statistical methods with varying coefficient models. Stat. Interf., 2008, 1: 179-195.
|
| [14] |
HarrisonD, RubinfeldDL. Hedonic housing prices and the demand for clean air. J. Environ. Econ. Manag., 1978, 5: 81-102.
|
| [15] |
HastieT, TibshiraniR. Varing coefficient models. J. Royal Stat. Soci.: Ser. B (Stat. Methodol.), 1993, 55: 757-779.
|
| [16] |
LiR, ZhongW, ZhuL. Feature screening via distance correlation learning. J. Am. Stat. Associat., 2012, 107: 1129-1139.
|
| [17] |
LiuJ, LiR, WuR. Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J. Am. Stat. Associat., 2014, 109: 266-274.
|
| [18] |
LiX, TangN, XieJ, YanX. A nonparametric feature screening method for ultrahigh-dimensional missing data. Comput. Stat. Data Anal., 2020, 142: 106828.
|
| [19] |
MaX, ZhangJ. A new variable selection approach for varying coefficient models. Metrika, 2016, 79: 59-72.
|
| [20] |
FulekyPMacroeconomic forecasting in the era of big data: theory and pratice, 20201BerlinSpringer.
|
| [21] |
QuL, SongX, SunL. Identification of local sparsity and variable selection for varying coefficient additive hazards models. Computat. Stat. Data Anal., 2018, 125: 119-135.
|
| [22] |
ScheetzTE, KimKA, SwiderskiRE, et al. . Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceed. Nat. Acad. Sci., 2006, 103: 14429-14434.
|
| [23] |
SchumakerLLSpline functions: basic theory, 2007CambridgeCambridge University Press.
|
| [24] |
SongR, YiF, ZouH. On varying-coefficient independence screening for high-dimensional varying-coefficient models. Stat. Sinica, 2014, 24: 1735-1752
|
| [25] |
TibshiraniR. Regression shrinkage and selection via the lasso. J. Royal Stat. Soci.: Series B (Stat. Methodol.), 1996, 58: 267-288.
|
| [26] |
TangN, XiaL, YanX. Feature screening in ultrahigh-dimensional partially linear models with missing responses at random. Comput. Stat. Data Anal., 2019, 133: 208-227.
|
| [27] |
TangY, SongX, WangHJ, ZhuZ. Variable selection in high-dimensional quantile varying coefficient models. J. Multivar. Anal., 2013, 122: 115-132.
|
| [28] |
WanX, YangC, YangQ, XueH, TangNL, YuW. Predictive rule inference for epistatic interaction detection in genome-wide association studies. Bioinformatics, 2010, 26: 30-37.
|
| [29] |
WangH. Forward regression for ultra-high dimensional variable screening. J. Am. Stat. Associat., 2009, 104: 1512-1524.
|
| [30] |
WangX, LengC. High dimensional ordinary least squares projection for screening variables. J. Royal Stat. Soci.: Series B (Stat. Methodol.), 2016, 78: 589-611.
|
| [31] |
WangK, LinL. Variable selection for varying coefficient models via kernel based regularized rank regression. Acta Mathematicae Applicatae Sinica, English Series., 2020, 36: 458-470.
|
| [32] |
XueL, QuA. Variable selection in high-dimensional varying-coefficient models with global optimality. J. Mach. Learn. Res., 2012, 13: 1973-1998
|
| [33] |
XieJ, LiuY, YanX, TangN. Categorical-adaptive variable screening for ultra-high dimensional heterogeneous categorical data. J. Am. Stat. Associat., 2019, 115: 747-760.
|
| [34] |
YangG, HuangJ, ZhouY. Concave group methods for variable selection and estimation in high-dimensional varying coefficient models. Sci. China Math., 2014, 57: 2073-2090.
|
| [35] |
YanX, TangN, XieJ, DingX, WangZ. Fused mean variance filter for ultra-high dimensional data. Computat. Stat. Data Anal., 2018, 122: 18-32.
|
| [36] |
ZhangCH. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat., 2010, 38: 894-942.
|
| [37] |
ZhuL, LiL, LiR, ZhuL. Model-free feature screening for ultrahigh-dimensional data. J. Am. Stat. Associat., 2011, 106: 1464-1475.
|
| [38] |
ZhouT, ZhuL, XuC, LiR. Model-free forward screening via cumulative divergence. J. Am. Stat. Associat., 2020, 115: 1393-1405.
|
Funding
National Natural Science Foundation of China(11871263)
Shenzhen Sci-Tech Fund(JCYJ20210324104803010)
RIGHTS & PERMISSIONS
School of Mathematical Sciences, University of Science and Technology of China and Springer-Verlag GmbH Germany, part of Springer Nature