Variable Selection for Distributed Sparse Regression Under Memory Constraints

Haofeng Wang, Xuejun Jiang, Min Zhou, Jiancheng Jiang

Communications in Mathematics and Statistics ›› 2023, Vol. 12 ›› Issue (2) : 307-338. DOI: 10.1007/s40304-022-00291-w
Article

Variable Selection for Distributed Sparse Regression Under Memory Constraints

Author information +
History +

Abstract

This paper studies variable selection using the penalized likelihood method for distributed sparse regression with large sample size n under a limited memory constraint. This is a much needed research problem to be solved in the big data era. A naive divide-and-conquer method solving this problem is to split the whole data into N parts and run each part on one of N machines, aggregate the results from all machines via averaging, and finally obtain the selected variables. However, it tends to select more noise variables, and the false discovery rate may not be well controlled. We improve it by a special designed weighted average in aggregation. Although the alternating direction method of multiplier can be used to deal with massive data in the literature, our proposed method reduces the computational burden a lot and performs better by mean square error in most cases. Theoretically, we establish asymptotic properties of the resulting estimators for the likelihood models with a diverging number of parameters. Under some regularity conditions, we establish oracle properties in the sense that our distributed estimator shares the same asymptotic efficiency as the estimator based on the full sample. Computationally, a distributed penalized likelihood algorithm is proposed to refine the results in the context of general likelihoods. Furthermore, the proposed method is evaluated by simulations and a real example.

Keywords

Variable selection / Distributed sparse regression / Memory constraints / Distributed penalized likelihood algorithm

Cite this article

Download citation ▾
Haofeng Wang, Xuejun Jiang, Min Zhou, Jiancheng Jiang. Variable Selection for Distributed Sparse Regression Under Memory Constraints. Communications in Mathematics and Statistics, 2023, 12(2): 307‒338 https://doi.org/10.1007/s40304-022-00291-w

References

[1.]
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends R Mach. Learn., 2011, 3(1): 1-122
[2.]
Chen X, Liu W, Zhang Y. Quantile regression under memory constraint. Ann. Stat., 2019, 47(6): 3244-3273,
CrossRef Google scholar
[3.]
Fan J, Han F, Liu H. Challenges of big data analysis. Natl. Sci. Rev., 2014, 1(2): 293-314,
CrossRef Google scholar
[4.]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc., 2001, 96(456): 1348-1360,
CrossRef Google scholar
[5.]
Fan J, Lv J. Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans. Inf. Theory, 2011, 57(8): 5467-5484,
CrossRef Google scholar
[6.]
Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Ann. Stat., 2004, 32(3): 928-961,
CrossRef Google scholar
[7.]
Fan Y, Tang Y, Zhu Z. Variable selection in censored quantile regression with high dimensional data. Sci. China Math., 2018, 61(4): 641-658,
CrossRef Google scholar
[8.]
Greenwald, M.B., Khanna, S.: Power-conserving computation of orderstatistics over sensor networks. In: Proceedings of the ACM Symposium on Principles of Database Systems, pp. 275–285 (2004)
[9.]
Guha S, Mcgregor A. Stream order and order statistics: quantile estimation in random order streams. SIAM J. Comput., 2009, 38(5): 2044-2059,
CrossRef Google scholar
[10.]
Hastie T, Tibshirani R, Friedman J. . The Elements of Statistical Learning; Data Mining, Inferenceand Prediction, 2001 New York City Springer
[11.]
Jiang X, Jiang J, Song X. Oracle model selection for nonlinear models based on weighted composite nonlinear quantile regression. Stat. Sin., 2009, 22(4): 1479-1506
[12.]
Jordan MI. On statistics, computation and scalability. Bernoulli, 2013, 19(4): 1378-1390,
CrossRef Google scholar
[13.]
Kam Hamidieh. A data-driven statistical model for predicting the critical temperature of a superconductor. Comput. Mater. Sci., 2008, 154: 346-354
[14.]
Li R, Lin DKJ, Li B. Statistical inference in massive data sets. Appl. Stoch. Models Bus. Ind., 2012, 29(5): 399-409,
CrossRef Google scholar
[15.]
Liu J, Zhong W, Li R. A selective overview of variable selection in urtrahigh dimensional feature space. Sci. China Math., 2015, 58(10): 2033-2054,
CrossRef Google scholar
[16.]
Mccullagh P, Nelder J. . Generalized Linear Models, 1989 2 London Chapman & Hall/CRC,
CrossRef Google scholar
[17.]
Mcdonald, R., Mohri, M., Silberman, N., Walker, D., Mann, G.S.: Efficient large-scale distributed training of conditional maximum entropy models. In: Advances in Neural Information Processing Systems, pp. 1231–1239 (2009)
[18.]
Minsker S. Geometric median and robust estimation in Banach spaces. Bernoulli, 2015, 21(4): 2308-2335,
CrossRef Google scholar
[19.]
Ripley BD. . Pattern Recognition and Neural Networks, 1996 Cambridge Cambridge University Press,
CrossRef Google scholar
[20.]
Song PX-K. . Correlated Data Analysis: Modeling, Analytics, and Applications, 2007 Berlin Springer
[21.]
Shi C, Song R, Chen Z, Li R. Linear hypothesis testing for high dimensional generalized linear models. Ann. Stat., 2019, 47(5): 2671-2703,
CrossRef Google scholar
[22.]
Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.), 1996, 58(1): 267-288,
CrossRef Google scholar
[23.]
Wan X, Yang C, Yang Q, Xue H, Tang NL, Yu W. Predictive rule inference for epistatic interaction detection in genome-wide association studies. Bioinformatics, 2010, 26(1): 30-37,
CrossRef Google scholar
[24.]
Wang, X., Peng, P., Dunson, D.B.: Median selection subset aggregation for parallel inference. In: Advances in Neural Information Processing Systems, pp. 2195–2203 (2014)
[25.]
Zhang C-H. Nearly unbiased variable selection under minimax concove penalty. Ann. Stat., 2010, 38(2): 894-942,
CrossRef Google scholar
[26.]
Zhang, Y., Wainwright, M.J., Duchi, J.C.: Communication-efficient algorithms for statistical optimization. In: Advances in Neural Information Processing Systems, pp. 1502–1510 (2012)
[27.]
Zhang, Q., Wang, W.: A fast algorithm for approximate quantiles in high speed data streams. In: 19-th International Conference on Scientific and Statistical Database Management, pp. 1551–6393 (2007)
[28.]
Zou H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc., 2006, 101(476): 1418-1429,
CrossRef Google scholar
Funding
National Natural Science Foundation of China(11871263); Natural Science Foundation of Guangdong Province(2017A030313012)

Accesses

Citations

Detail

Sections
Recommended

/