PDF
Abstract
Extraordinary amounts of data are being produced in almost every branch of science. Proven statistical methods are no longer applicable with super large data sets due to computational limitations. To address this issue, subdata selection is considered to be an effective strategy. In this study, we propose a novel framework of selecting subsets of data for spatial autoregression. We show that, while the information contained in the subdata based on random sampling approaches is limited by the size of the subset, the information contained in the subdata based on the new framework increases as the size of the full data set increases. We propose a novel approach, termed information-based optimal subdata selection. Performances of the proposed approach and that of random sampling method are compared under various criteria via extensive simulation studies. Theoretical results and extensive simulation demonstrate that IBOSS approach performs better than random subsampling method. The advantages of the new approach are also illustrated through analysis of real data.
Keywords
Massive data
/
Information matrix
/
D-optimality criterion
/
Subdata
/
62H10
/
62H12
Cite this article
Download citation ▾
Yunquan Song, Sijia Shen, Yaqi Liu.
Information-Based Optimal Subdata Selection for Large Sample Spatial Autoregression.
Communications in Mathematics and Statistics 1-23 DOI:10.1007/s40304-024-00435-0
| [1] |
BanerjeeS, CarlinBP, GelfandAEHierarchical Modelling and Analysis for spatial Data, 2014, Boca Raton. Chapman and Hall/CRC.
|
| [2] |
DanSL. Spatial autoregression modeling of site-specific wheat yield. Geoderma, 1998, 85(2–3): 181-197
|
| [3] |
Dhillon, P.S., Lu, Y., Foster, D., Ungar, L.: New subsampling algorithms for fast least squares regression. In: NIPS’13: Proceedings of the 26th International Conference on Neural Information Processing Systems, 1, pp. 360–368 (2013)
|
| [4] |
HadamardJ. Résolution d’une question relative aux déterminants. Bull. Sci. Math., 1893, 17(1): 240-246
|
| [5] |
HuangD, LanW, ZhangH, WangH. Least squares estimation of spatial autoregressive models for large-scale social networks. Electron. J. Stat., 2019, 13: 1135-1165
|
| [6] |
Kazar, B.M., Shekhar, S., Lilja, D.J., Boley, D.: A parallel formulation of the spatial autoregression model for mining large geo-spatial datasets. In: Siam International Conf on Data Mining Workshop on High Performance & Distributed Mining (2004)
|
| [7] |
KieferJ. Optimum experimental designs. J. R. Stat. Soc.: Ser. B (Methodol.), 1959, 21(2): 272-319
|
| [8] |
Ma, P., Mahoney, M., Yu, B.: A statistical perspective on algorithmic leveraging. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 91–99 (2014)
|
| [9] |
Ma, Y., Pan, R., Zou, T., Wang, H.: A naive least squares method for spatial autoregression with covariates. Statistica Sinica (2020)
|
| [10] |
MaP, SunX. Leveraging for big data regression. Wiley Interdiscip. Rev. Comput. Stat., 2015, 7(1): 70-76
|
| [11] |
MaP, MahoneyM, YuB. A statistical perspective on algorithmic leveraging. J. Mach. Learn. Res., 2015, 16: 861-911
|
| [12] |
MoranPAP. Notes on continuous stochastic phenomena. Biometrika, 1950, 37(1/2): 17-23
|
| [13] |
MusserDR. Introspective sorting and selection algorithms. Softw. Pract. Exp., 1997, 27(8): 983-993
|
| [14] |
PolitisDN, RomanoJP, WolfMSubsampling, 1999, Berlin. Springer Science & Business Media.
|
| [15] |
SnijdersTom A.B.. Statistical models for social networks. Ann. Rev. Sociol., 2011, 37: 131-153
|
| [16] |
Timlin, D. J., Walthall, C., Pachepsky, Y., Dulaney, W., Daughtry, C.: Spatial regression of crop parameters with airborne spectral imagery. In: 3rd International Conference on Geospatial Information in Agriculture and Forestry (2001)
|
| [17] |
WangH. Divide-and-conquer information-based optimal subdata selection algorithm. J. Stat. Theory Pract., 2019.
|
| [18] |
WangH, YangM, StufkenJ. Information-based optimal subdata selection for big data linear regression. J. Am. Stat. Assoc., 2018.
|
| [19] |
WangH, ZhuR, MaP. Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc., 2018, 113: 829-844
|
| [20] |
XieT, CaoR, DuJ. Variable selection for spatial autoregressive models with a diverging number of parameters. Stat. Pap., 2018.
|
| [21] |
YaoY, WangHY. Optimal subsampling for softmax regression. Stat. Pap., 2018.
|
| [22] |
ZhouJ, TuY, ChenY, WangH. Estimating spatial autocorrelation with sampled network data. J. Bus. Econ. Stat., 2017, 35: 130-138
|
Funding
National Key Research and Development Program of China(2021YFA1000102)
NSF project of Shandong Province of China(ZR2019MA016)
RIGHTS & PERMISSIONS
School of Mathematical Sciences, University of Science and Technology of China and Springer-Verlag GmbH Germany, part of Springer Nature