PDF
Abstract
In this paper, we mainly study the feature screening and error variance estimation in ultrahigh-dimensional linear model with errors-in-variables (EV). Given that sure independence screening (SIS) method by marginal Pearson’s correlation learning may omit some important observation variables due to measurement errors, a corrected SIS called EVSIS is proposed to rank the importance of features according to their corrected marginal correlation with the response variable. Also, a corrected error variance procedure is proposed to accurately estimate the error variance, which could greatly attenuate the influence of measurement errors and spurious correlations, simultaneously. Under some regularization conditions, the proposed EVSIS possesses sure screening property and consistency in ranking and the corrected error variance estimator is also proved to be asymptotically normal. The two methodologies are illustrated by some simulations and a real data example, which suggests that the proposed methods perform well.
Keywords
Ultrahigh-dimensional linear model
/
Measurement error
/
Feature screening
/
Error variance estimation
/
Sure screening property
/
Asymptotic properties
Cite this article
Download citation ▾
Hengjian Cui, Feng Zou, Li Ling.
Feature Screening and Error Variance Estimation for Ultrahigh-Dimensional Linear Model with Measurement Errors.
Communications in Mathematics and Statistics 1-33 DOI:10.1007/s40304-022-00317-3
| [1] |
Belloni, A., Chernozhukov, V., Kaul, A.: Confidence bands for coefficients in high dimensional linear models with error-in-variables. arXiv preprint arXiv:1703.00469 (2017)
|
| [2] |
Benjamini Y, Speed TP. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res.. 2012, 40 e72
|
| [3] |
Buonaccorsi JP. Measurement Error: Models, Methods, and Applications. 2010 Boca Raton: Chapman and Hall/CRC
|
| [4] |
Candes E, Tao T. The Dantzig selector: statistical estimation when p is much larger than n. Ann. Stat.. 2007, 35 2313-2351
|
| [5] |
Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models. 2006 Boca Raton: Chapman and Hall/CRC
|
| [6] |
Chen Y, Caramanis C. Noisy and missing data regression: distribution oblivious support recovery. Proc. Mach. Learn. Res.. 2013, 28 383-391
|
| [7] |
Chen Z, Fan J, Li R. Error variance estimation in ultrahigh dimensional additive models. J. Am. Stat. Assoc.. 2018, 113 315-327
|
| [8] |
Cheng CL, Van Ness JW. Statistical Regression with Measurement Error. 1999 London: Arnold
|
| [9] |
Chiang AP, Beck JS, Yen HJ, Tayeh MK, Scheetz TE, Swiderski RE, Nishimura DY, Braun TA, Kim KYA, Huang J. Homozygosity mapping with SNP arrays identifies a novel gene for Bardet–Biedl syndrome (BBS10). Proc. Natl. Acad. Sci. U. S. A.. 2006, 103 6287-6292
|
| [10] |
Cui H, Li R, Zhong W. Model-free feature screening for ultrahigh dimensional discriminant analysis. J. Am. Stat. Assoc.. 2015, 110 630-641
|
| [11] |
Datta A, Zou H. Cocolasso for high-dimensional error-in-variables regression. Ann. Stat.. 2017, 45 2400-2426
|
| [12] |
Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultra-high dimensional additive models. J. Am. Stat. Assoc.. 2011, 106 544-557
|
| [13] |
Fan J, Guo S, Hao N. Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Stat. Soc. Ser. B. 2012, 74 37-65
|
| [14] |
Fan J, Li R. Variable selection via nonconvave penalized likelihood and its oracle properties. J. Am. Stat. Assoc.. 2001, 96 1348-1360
|
| [15] |
Fan J, Lv J. Sure independence screening for ultra-high dimensional feature space. J. R. Stat. Soc. Ser. B. 2008, 70 849-911
|
| [16] |
Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Stat. Sin.. 2010, 20 101-148
|
| [17] |
Fan J, Samworth R, Wu Y. Ultrahigh dimensional feature selection: beyond the linear model. J. Mach. Learn. Res.. 2009, 10 2013-2038
|
| [18] |
Fan J, Song R. Sure independence screening in generalized linear models with NP-dimensionality. Ann. Stat.. 2010, 38 3567-3604
|
| [19] |
Fuller WA. Measurement Error Models. 1987 New York: Wiley
|
| [20] |
He X, Wang L, Hong HG. Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann. Stat.. 2013, 41 342-369
|
| [21] |
Huang D, Li R, Wang H. Feature screening for ultrahigh dimensional categorical data with applications. J. Bus. Econ. Stat.. 2014, 32 237-244
|
| [22] |
Huang J, Ma S, Zhang CH. Adaptive lasso for sparse high-dimensional regression. Stat. Sin.. 2008, 18 1603-1618
|
| [23] |
Kaul, A., Koul, H.L., Chawla, A., Lahiri, S.N.: Two stage non-penalized corrected least squares for high dimensional linear models with measurement error or missing covariates. arXiv preprint arXiv:1605.03154 (2016)
|
| [24] |
Kaul A, Koul HL. Weighted $l_1$-penalized corrected quantile regression for high dimensional measurement error models. J. Multivar. Anal.. 2015, 140 72-91
|
| [25] |
Li G, Peng H, Zhang J, Zhu L. Robust rank correlation based screening. Ann. Stat.. 2012, 40 1846-1877
|
| [26] |
Li R, Zhong W, Zhu L. Feature screening via distance correlation learning. J. Am. Stat. Assoc.. 2012, 107 1129-1139
|
| [27] |
Liang H, Härdle W, Carroll RJ. Estimation in a semiparametric partially linear errors-in-variables model. Ann. Stat.. 1999, 27 1519-1535
|
| [28] |
Liang H, Li R. Variable selection for partially linear models with measurement errors. J. Am. Stat. Assoc.. 2009, 104 234-248
|
| [29] |
Lin Z, Bai Z. Probability Inequalities. 2010 New York: Wiley
|
| [30] |
Liu J, Li R, Wu R. Feature selection for varying coefficient models with ultrahigh-dimensional covariates. J. Am. Stat. Assoc.. 2014, 109 266-274
|
| [31] |
Loh PL, Wainwright MJ. High-dimensional regression with noisy and missing data: provable guarantees with nonconvexity. Ann. Stat.. 2012, 40 1637-1664
|
| [32] |
Ma Y, Li R. Variable selection in measurement error models. Bernoulli. 2010, 16 274-300
|
| [33] |
Mai Q, Zou H. The Kolmogorov filter for variable screening in high-dimensional binary classification. Biometrika. 2013, 100 229-234
|
| [34] |
Meinshausen N, Meier L, Bühlmann P. P-values for high-dimensional regression. J. Am. Stat. Assoc.. 2009, 104 1671-1681
|
| [35] |
Purdom E, Holmes SP. Error distribution for gene expression data. Stat. Appl. Genet. Mol. Biol.. 2005, 4 16
|
| [36] |
Rocke DM, Durbin B. A model for measurement error for gene expression arrays. J. Comput. Biol.. 2001, 8 557-569
|
| [37] |
Rosenbaum M, Tsybakov AB. Sparse recovery under matrix uncertainty. Ann. Stat.. 2010, 38 2620-2651
|
| [38] |
Scheetz TE, Kim KY, Swiderski RE, Philp AR, Braun TA, Knudtson KL, Dorrance AM, Dibona GF, Huang J, Casavant TL. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc. Natl. Acad. Sci. U. S. A.. 2006, 103 14429-14434
|
| [39] |
Slijepcevic S, Megerian S, Potkonjak M. Location errors in wireless embedded sensor networks. ACM Sigmobile Mobile Comput. Commun. Rev.. 2002, 6 67-78
|
| [40] |
Sørensen o, Frigessi A, Thoresen M. Measurement error in lasso: impact and likelihood bias correction. Stat. Sin.. 2015, 25 809-829
|
| [41] |
Sørensen o, Hellton KH, Frigessi A, Thoresen M. Covariate selection in high-dimensional generalized linear models with measurement error. J. Comput. Graph. Stat.. 2018, 27 739-749
|
| [42] |
Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. 1996, 58 267-288
|
| [43] |
Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007, 94 553-568
|
| [44] |
Xie J, Lin Y, Yan X, Tang N. Category-adaptive variable screening for ultra-high dimensional heterogeneous categorical data. J. Am. Stat. Assoc.. 2019, 115 1-34
|
| [45] |
Xu Q, You J. Covariate selection for linear errors-in-variables regression models. Commun. Stat. Theory Methods. 2007, 36 375-386
|
| [46] |
You J, Xu Q, Zhou B. Statistical inference for partially linear regression models with measurement errors. Chin. Ann. Math. Ser. B. 2008, 29 207-222
|
| [47] |
Zhang C. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat.. 2010, 38 894-942
|
| [48] |
Zhou Z, Jiang R, Qian W. Variable selection for additive partially linear models with measurement error. Metrika. 2011, 74 185-202
|
| [49] |
Zhu L, Cui H. A semi-parametric regression model with errors in variables. Scand. J. Stat.. 2003, 30 429-442
|
| [50] |
Zhu L, Li L, Li R, Zhu L. Model-free feature screening for ultrahigh-dimensional data. J. Am. Stat. Assoc.. 2011, 106 1464-1475
|
| [51] |
Zhu X, Yang Y. Variable selection after screening: with or without data splitting?. Comput. Stat.. 2015, 30 191-203
|
| [52] |
Zou H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc.. 2006, 101 1418-1429
|
Funding
National Natural Science Foundation of China(No. 11971324)
The State Key Program of National Natural Science Foundation of China(No. 12031016)