Variable Selection for Distributed Sparse Regression Under Memory Constraints

Haofeng Wang, Xuejun Jiang, Min Zhou, Jiancheng Jiang

Communications in Mathematics and Statistics ›› 2023, Vol. 12 ›› Issue (2) : 307-338.

Communications in Mathematics and Statistics ›› 2023, Vol. 12 ›› Issue (2) : 307-338. DOI: 10.1007/s40304-022-00291-w
Article

Variable Selection for Distributed Sparse Regression Under Memory Constraints

Author information +
History +

Abstract

This paper studies variable selection using the penalized likelihood method for distributed sparse regression with large sample size n under a limited memory constraint. This is a much needed research problem to be solved in the big data era. A naive divide-and-conquer method solving this problem is to split the whole data into N parts and run each part on one of N machines, aggregate the results from all machines via averaging, and finally obtain the selected variables. However, it tends to select more noise variables, and the false discovery rate may not be well controlled. We improve it by a special designed weighted average in aggregation. Although the alternating direction method of multiplier can be used to deal with massive data in the literature, our proposed method reduces the computational burden a lot and performs better by mean square error in most cases. Theoretically, we establish asymptotic properties of the resulting estimators for the likelihood models with a diverging number of parameters. Under some regularity conditions, we establish oracle properties in the sense that our distributed estimator shares the same asymptotic efficiency as the estimator based on the full sample. Computationally, a distributed penalized likelihood algorithm is proposed to refine the results in the context of general likelihoods. Furthermore, the proposed method is evaluated by simulations and a real example.

Keywords

Variable selection / Distributed sparse regression / Memory constraints / Distributed penalized likelihood algorithm

Cite this article

Download citation ▾
Haofeng Wang, Xuejun Jiang, Min Zhou, Jiancheng Jiang. Variable Selection for Distributed Sparse Regression Under Memory Constraints. Communications in Mathematics and Statistics, 2023, 12(2): 307‒338 https://doi.org/10.1007/s40304-022-00291-w

References

[1.]
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends R Mach. Learn.. 2011, 3 1 1-122
[2.]
Chen X, Liu W, Zhang Y. Quantile regression under memory constraint. Ann. Stat.. 2019, 47 6 3244-3273
CrossRef Google scholar
[3.]
Fan J, Han F, Liu H. Challenges of big data analysis. Natl. Sci. Rev.. 2014, 1 2 293-314
CrossRef Google scholar
[4.]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc.. 2001, 96 456 1348-1360
CrossRef Google scholar
[5.]
Fan J, Lv J. Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans. Inf. Theory. 2011, 57 8 5467-5484
CrossRef Google scholar
[6.]
Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Ann. Stat.. 2004, 32 3 928-961
CrossRef Google scholar
[7.]
Fan Y, Tang Y, Zhu Z. Variable selection in censored quantile regression with high dimensional data. Sci. China Math.. 2018, 61 4 641-658
CrossRef Google scholar
[8.]
Greenwald, M.B., Khanna, S.: Power-conserving computation of orderstatistics over sensor networks. In: Proceedings of the ACM Symposium on Principles of Database Systems, pp. 275–285 (2004)
[9.]
Guha S, Mcgregor A. Stream order and order statistics: quantile estimation in random order streams. SIAM J. Comput.. 2009, 38 5 2044-2059
CrossRef Google scholar
[10.]
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning; Data Mining, Inferenceand Prediction. 2001 New York City: Springer
[11.]
Jiang X, Jiang J, Song X. Oracle model selection for nonlinear models based on weighted composite nonlinear quantile regression. Stat. Sin.. 2009, 22 4 1479-1506
[12.]
Jordan MI. On statistics, computation and scalability. Bernoulli. 2013, 19 4 1378-1390
CrossRef Google scholar
[13.]
Kam Hamidieh. A data-driven statistical model for predicting the critical temperature of a superconductor. Comput. Mater. Sci.. 2008, 154 346-354
[14.]
Li R, Lin DKJ, Li B. Statistical inference in massive data sets. Appl. Stoch. Models Bus. Ind.. 2012, 29 5 399-409
CrossRef Google scholar
[15.]
Liu J, Zhong W, Li R. A selective overview of variable selection in urtrahigh dimensional feature space. Sci. China Math.. 2015, 58 10 2033-2054
CrossRef Google scholar
[16.]
Mccullagh P, Nelder J. Generalized Linear Models. 1989 2 London: Chapman & Hall/CRC
CrossRef Google scholar
[17.]
Mcdonald, R., Mohri, M., Silberman, N., Walker, D., Mann, G.S.: Efficient large-scale distributed training of conditional maximum entropy models. In: Advances in Neural Information Processing Systems, pp. 1231–1239 (2009)
[18.]
Minsker S. Geometric median and robust estimation in Banach spaces. Bernoulli. 2015, 21 4 2308-2335
CrossRef Google scholar
[19.]
Ripley BD. Pattern Recognition and Neural Networks. 1996 Cambridge: Cambridge University Press
CrossRef Google scholar
[20.]
Song PX-K. Correlated Data Analysis: Modeling, Analytics, and Applications. 2007 Berlin: Springer
[21.]
Shi C, Song R, Chen Z, Li R. Linear hypothesis testing for high dimensional generalized linear models. Ann. Stat.. 2019, 47 5 2671-2703
CrossRef Google scholar
[22.]
Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.). 1996, 58 1 267-288
CrossRef Google scholar
[23.]
Wan X, Yang C, Yang Q, Xue H, Tang NL, Yu W. Predictive rule inference for epistatic interaction detection in genome-wide association studies. Bioinformatics. 2010, 26 1 30-37
CrossRef Google scholar
[24.]
Wang, X., Peng, P., Dunson, D.B.: Median selection subset aggregation for parallel inference. In: Advances in Neural Information Processing Systems, pp. 2195–2203 (2014)
[25.]
Zhang C-H. Nearly unbiased variable selection under minimax concove penalty. Ann. Stat.. 2010, 38 2 894-942
CrossRef Google scholar
[26.]
Zhang, Y., Wainwright, M.J., Duchi, J.C.: Communication-efficient algorithms for statistical optimization. In: Advances in Neural Information Processing Systems, pp. 1502–1510 (2012)
[27.]
Zhang, Q., Wang, W.: A fast algorithm for approximate quantiles in high speed data streams. In: 19-th International Conference on Scientific and Statistical Database Management, pp. 1551–6393 (2007)
[28.]
Zou H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc.. 2006, 101 476 1418-1429
CrossRef Google scholar
Funding
National Natural Science Foundation of China(11871263); Natural Science Foundation of Guangdong Province(2017A030313012)

Accesses

Citations

Detail

Sections
Recommended

/