Variable Selection for Distributed Sparse Regression Under Memory Constraints

Haofeng Wang , Xuejun Jiang , Min Zhou , Jiancheng Jiang

Communications in Mathematics and Statistics ›› 2023, Vol. 12 ›› Issue (2) : 307 -338.

PDF
Communications in Mathematics and Statistics ›› 2023, Vol. 12 ›› Issue (2) : 307 -338. DOI: 10.1007/s40304-022-00291-w
Article

Variable Selection for Distributed Sparse Regression Under Memory Constraints

Author information +
History +
PDF

Abstract

This paper studies variable selection using the penalized likelihood method for distributed sparse regression with large sample size n under a limited memory constraint. This is a much needed research problem to be solved in the big data era. A naive divide-and-conquer method solving this problem is to split the whole data into N parts and run each part on one of N machines, aggregate the results from all machines via averaging, and finally obtain the selected variables. However, it tends to select more noise variables, and the false discovery rate may not be well controlled. We improve it by a special designed weighted average in aggregation. Although the alternating direction method of multiplier can be used to deal with massive data in the literature, our proposed method reduces the computational burden a lot and performs better by mean square error in most cases. Theoretically, we establish asymptotic properties of the resulting estimators for the likelihood models with a diverging number of parameters. Under some regularity conditions, we establish oracle properties in the sense that our distributed estimator shares the same asymptotic efficiency as the estimator based on the full sample. Computationally, a distributed penalized likelihood algorithm is proposed to refine the results in the context of general likelihoods. Furthermore, the proposed method is evaluated by simulations and a real example.

Keywords

Variable selection / Distributed sparse regression / Memory constraints / Distributed penalized likelihood algorithm

Cite this article

Download citation ▾
Haofeng Wang, Xuejun Jiang, Min Zhou, Jiancheng Jiang. Variable Selection for Distributed Sparse Regression Under Memory Constraints. Communications in Mathematics and Statistics, 2023, 12(2): 307-338 DOI:10.1007/s40304-022-00291-w

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends R Mach. Learn.. 2011, 3 1 1-122

[2]

Chen X, Liu W, Zhang Y. Quantile regression under memory constraint. Ann. Stat.. 2019, 47 6 3244-3273

[3]

Fan J, Han F, Liu H. Challenges of big data analysis. Natl. Sci. Rev.. 2014, 1 2 293-314

[4]

Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc.. 2001, 96 456 1348-1360

[5]

Fan J, Lv J. Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans. Inf. Theory. 2011, 57 8 5467-5484

[6]

Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Ann. Stat.. 2004, 32 3 928-961

[7]

Fan Y, Tang Y, Zhu Z. Variable selection in censored quantile regression with high dimensional data. Sci. China Math.. 2018, 61 4 641-658

[8]

Greenwald, M.B., Khanna, S.: Power-conserving computation of orderstatistics over sensor networks. In: Proceedings of the ACM Symposium on Principles of Database Systems, pp. 275–285 (2004)

[9]

Guha S, Mcgregor A. Stream order and order statistics: quantile estimation in random order streams. SIAM J. Comput.. 2009, 38 5 2044-2059

[10]

Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning; Data Mining, Inferenceand Prediction. 2001 New York City: Springer

[11]

Jiang X, Jiang J, Song X. Oracle model selection for nonlinear models based on weighted composite nonlinear quantile regression. Stat. Sin.. 2009, 22 4 1479-1506

[12]

Jordan MI. On statistics, computation and scalability. Bernoulli. 2013, 19 4 1378-1390

[13]

Kam Hamidieh. A data-driven statistical model for predicting the critical temperature of a superconductor. Comput. Mater. Sci.. 2008, 154 346-354

[14]

Li R, Lin DKJ, Li B. Statistical inference in massive data sets. Appl. Stoch. Models Bus. Ind.. 2012, 29 5 399-409

[15]

Liu J, Zhong W, Li R. A selective overview of variable selection in urtrahigh dimensional feature space. Sci. China Math.. 2015, 58 10 2033-2054

[16]

Mccullagh P, Nelder J. Generalized Linear Models. 1989 2 London: Chapman & Hall/CRC

[17]

Mcdonald, R., Mohri, M., Silberman, N., Walker, D., Mann, G.S.: Efficient large-scale distributed training of conditional maximum entropy models. In: Advances in Neural Information Processing Systems, pp. 1231–1239 (2009)

[18]

Minsker S. Geometric median and robust estimation in Banach spaces. Bernoulli. 2015, 21 4 2308-2335

[19]

Ripley BD. Pattern Recognition and Neural Networks. 1996 Cambridge: Cambridge University Press

[20]

Song PX-K. Correlated Data Analysis: Modeling, Analytics, and Applications. 2007 Berlin: Springer

[21]

Shi C, Song R, Chen Z, Li R. Linear hypothesis testing for high dimensional generalized linear models. Ann. Stat.. 2019, 47 5 2671-2703

[22]

Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.). 1996, 58 1 267-288

[23]

Wan X, Yang C, Yang Q, Xue H, Tang NL, Yu W. Predictive rule inference for epistatic interaction detection in genome-wide association studies. Bioinformatics. 2010, 26 1 30-37

[24]

Wang, X., Peng, P., Dunson, D.B.: Median selection subset aggregation for parallel inference. In: Advances in Neural Information Processing Systems, pp. 2195–2203 (2014)

[25]

Zhang C-H. Nearly unbiased variable selection under minimax concove penalty. Ann. Stat.. 2010, 38 2 894-942

[26]

Zhang, Y., Wainwright, M.J., Duchi, J.C.: Communication-efficient algorithms for statistical optimization. In: Advances in Neural Information Processing Systems, pp. 1502–1510 (2012)

[27]

Zhang, Q., Wang, W.: A fast algorithm for approximate quantiles in high speed data streams. In: 19-th International Conference on Scientific and Statistical Database Management, pp. 1551–6393 (2007)

[28]

Zou H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc.. 2006, 101 476 1418-1429

Funding

National Natural Science Foundation of China(11871263)

Natural Science Foundation of Guangdong Province(2017A030313012)

AI Summary AI Mindmap
PDF

201

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/