Information-Based Optimal Subdata Selection for Large Sample Spatial Autoregression

Yunquan Song , Sijia Shen , Yaqi Liu

Communications in Mathematics and Statistics ›› : 1 -23.

PDF
Communications in Mathematics and Statistics ›› : 1 -23. DOI: 10.1007/s40304-024-00435-0
Article
research-article

Information-Based Optimal Subdata Selection for Large Sample Spatial Autoregression

Author information +
History +
PDF

Abstract

Extraordinary amounts of data are being produced in almost every branch of science. Proven statistical methods are no longer applicable with super large data sets due to computational limitations. To address this issue, subdata selection is considered to be an effective strategy. In this study, we propose a novel framework of selecting subsets of data for spatial autoregression. We show that, while the information contained in the subdata based on random sampling approaches is limited by the size of the subset, the information contained in the subdata based on the new framework increases as the size of the full data set increases. We propose a novel approach, termed information-based optimal subdata selection. Performances of the proposed approach and that of random sampling method are compared under various criteria via extensive simulation studies. Theoretical results and extensive simulation demonstrate that IBOSS approach performs better than random subsampling method. The advantages of the new approach are also illustrated through analysis of real data.

Keywords

Massive data / Information matrix / D-optimality criterion / Subdata / 62H10 / 62H12

Cite this article

Download citation ▾
Yunquan Song, Sijia Shen, Yaqi Liu. Information-Based Optimal Subdata Selection for Large Sample Spatial Autoregression. Communications in Mathematics and Statistics 1-23 DOI:10.1007/s40304-024-00435-0

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

BanerjeeS, CarlinBP, GelfandAEHierarchical Modelling and Analysis for spatial Data, 2014, Boca Raton. Chapman and Hall/CRC.

[2]

DanSL. Spatial autoregression modeling of site-specific wheat yield. Geoderma, 1998, 85(2–3): 181-197

[3]

Dhillon, P.S., Lu, Y., Foster, D., Ungar, L.: New subsampling algorithms for fast least squares regression. In: NIPS’13: Proceedings of the 26th International Conference on Neural Information Processing Systems, 1, pp. 360–368 (2013)

[4]

HadamardJ. Résolution d’une question relative aux déterminants. Bull. Sci. Math., 1893, 17(1): 240-246

[5]

HuangD, LanW, ZhangH, WangH. Least squares estimation of spatial autoregressive models for large-scale social networks. Electron. J. Stat., 2019, 13: 1135-1165

[6]

Kazar, B.M., Shekhar, S., Lilja, D.J., Boley, D.: A parallel formulation of the spatial autoregression model for mining large geo-spatial datasets. In: Siam International Conf on Data Mining Workshop on High Performance & Distributed Mining (2004)

[7]

KieferJ. Optimum experimental designs. J. R. Stat. Soc.: Ser. B (Methodol.), 1959, 21(2): 272-319

[8]

Ma, P., Mahoney, M., Yu, B.: A statistical perspective on algorithmic leveraging. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 91–99 (2014)

[9]

Ma, Y., Pan, R., Zou, T., Wang, H.: A naive least squares method for spatial autoregression with covariates. Statistica Sinica (2020)

[10]

MaP, SunX. Leveraging for big data regression. Wiley Interdiscip. Rev. Comput. Stat., 2015, 7(1): 70-76

[11]

MaP, MahoneyM, YuB. A statistical perspective on algorithmic leveraging. J. Mach. Learn. Res., 2015, 16: 861-911

[12]

MoranPAP. Notes on continuous stochastic phenomena. Biometrika, 1950, 37(1/2): 17-23

[13]

MusserDR. Introspective sorting and selection algorithms. Softw. Pract. Exp., 1997, 27(8): 983-993

[14]

PolitisDN, RomanoJP, WolfMSubsampling, 1999, Berlin. Springer Science & Business Media.

[15]

SnijdersTom A.B.. Statistical models for social networks. Ann. Rev. Sociol., 2011, 37: 131-153

[16]

Timlin, D. J., Walthall, C., Pachepsky, Y., Dulaney, W., Daughtry, C.: Spatial regression of crop parameters with airborne spectral imagery. In: 3rd International Conference on Geospatial Information in Agriculture and Forestry (2001)

[17]

WangH. Divide-and-conquer information-based optimal subdata selection algorithm. J. Stat. Theory Pract., 2019.

[18]

WangH, YangM, StufkenJ. Information-based optimal subdata selection for big data linear regression. J. Am. Stat. Assoc., 2018.

[19]

WangH, ZhuR, MaP. Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc., 2018, 113: 829-844

[20]

XieT, CaoR, DuJ. Variable selection for spatial autoregressive models with a diverging number of parameters. Stat. Pap., 2018.

[21]

YaoY, WangHY. Optimal subsampling for softmax regression. Stat. Pap., 2018.

[22]

ZhouJ, TuY, ChenY, WangH. Estimating spatial autocorrelation with sampled network data. J. Bus. Econ. Stat., 2017, 35: 130-138

Funding

National Key Research and Development Program of China(2021YFA1000102)

NSF project of Shandong Province of China(ZR2019MA016)

RIGHTS & PERMISSIONS

School of Mathematical Sciences, University of Science and Technology of China and Springer-Verlag GmbH Germany, part of Springer Nature

AI Summary AI Mindmap
PDF

53

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/