Variable Selection for Distributed Sparse Regression Under Memory Constraints
Haofeng Wang, Xuejun Jiang, Min Zhou, Jiancheng Jiang
Variable Selection for Distributed Sparse Regression Under Memory Constraints
This paper studies variable selection using the penalized likelihood method for distributed sparse regression with large sample size n under a limited memory constraint. This is a much needed research problem to be solved in the big data era. A naive divide-and-conquer method solving this problem is to split the whole data into N parts and run each part on one of N machines, aggregate the results from all machines via averaging, and finally obtain the selected variables. However, it tends to select more noise variables, and the false discovery rate may not be well controlled. We improve it by a special designed weighted average in aggregation. Although the alternating direction method of multiplier can be used to deal with massive data in the literature, our proposed method reduces the computational burden a lot and performs better by mean square error in most cases. Theoretically, we establish asymptotic properties of the resulting estimators for the likelihood models with a diverging number of parameters. Under some regularity conditions, we establish oracle properties in the sense that our distributed estimator shares the same asymptotic efficiency as the estimator based on the full sample. Computationally, a distributed penalized likelihood algorithm is proposed to refine the results in the context of general likelihoods. Furthermore, the proposed method is evaluated by simulations and a real example.
Variable selection / Distributed sparse regression / Memory constraints / Distributed penalized likelihood algorithm
[1.] |
|
[2.] |
|
[3.] |
|
[4.] |
|
[5.] |
|
[6.] |
|
[7.] |
|
[8.] |
Greenwald, M.B., Khanna, S.: Power-conserving computation of orderstatistics over sensor networks. In: Proceedings of the ACM Symposium on Principles of Database Systems, pp. 275–285 (2004)
|
[9.] |
|
[10.] |
|
[11.] |
|
[12.] |
|
[13.] |
|
[14.] |
|
[15.] |
|
[16.] |
|
[17.] |
Mcdonald, R., Mohri, M., Silberman, N., Walker, D., Mann, G.S.: Efficient large-scale distributed training of conditional maximum entropy models. In: Advances in Neural Information Processing Systems, pp. 1231–1239 (2009)
|
[18.] |
|
[19.] |
|
[20.] |
|
[21.] |
|
[22.] |
|
[23.] |
|
[24.] |
Wang, X., Peng, P., Dunson, D.B.: Median selection subset aggregation for parallel inference. In: Advances in Neural Information Processing Systems, pp. 2195–2203 (2014)
|
[25.] |
|
[26.] |
Zhang, Y., Wainwright, M.J., Duchi, J.C.: Communication-efficient algorithms for statistical optimization. In: Advances in Neural Information Processing Systems, pp. 1502–1510 (2012)
|
[27.] |
Zhang, Q., Wang, W.: A fast algorithm for approximate quantiles in high speed data streams. In: 19-th International Conference on Scientific and Statistical Database Management, pp. 1551–6393 (2007)
|
[28.] |
|
/
〈 | 〉 |