Significance Test for Multinomial Naive Bayes Classifier with Ultra-high-dimensional Binary Features

Baiguo An , Juan Zhang , Beibei Zhang , Wenliang Pan

Communications in Mathematics and Statistics ›› : 1 -26.

PDF
Communications in Mathematics and Statistics ›› :1 -26. DOI: 10.1007/s40304-025-00471-4
Article
research-article
Significance Test for Multinomial Naive Bayes Classifier with Ultra-high-dimensional Binary Features
Author information +
History +
PDF

Abstract

We developed a significance test method for multinomial naive Bayes classifier with ultra-high-dimensional binary features. A novel test statistic with asymptotic standard Gaussian null distribution is proposed. Under very mild assumptions, the proposed test statistic has powers that tend to 1 as the sample size tends to infinity. Then, a sequential test process is developed to perform variable screening. We applied the proposed methods to lots of numerical studies including simulated examples and two real text data classification examples. The results show that our methods have good finite sample performances.

Keywords

Binary feature / Multinomial naive bayes / Significance test / Text classification / Ultra-high dimensional / 62H15 / 62H30

Cite this article

Download citation ▾
Baiguo An, Juan Zhang, Beibei Zhang, Wenliang Pan. Significance Test for Multinomial Naive Bayes Classifier with Ultra-high-dimensional Binary Features. Communications in Mathematics and Statistics 1-26 DOI:10.1007/s40304-025-00471-4

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Agresti A. Categorical Data Analysis (3rd edition), 2012. New Jersey, Wiley

[2]

An B, Feng G, Guo J. Interaction identification and clique screening for classification with ultra-high dimensional discrete features. J. Classif., 2022, 39(1): 122-146.

[3]

Bühlmann P. Statistical significance in high-dimensional linear models. Bernoulli, 2013, 19(4): 1212-1242.

[4]

Card D, Chang S, Becker C, et al.. Computational analysis of 140 years of us political speeches reveals more positive but increasingly polarized framing of immigration. Proc. Nat. Acad. Sci., 2022, 119(31): e2120510,119.

[5]

Feng G, Guo J, Jing BY, et al.. Feature subset selection using naive bayes for text classification. Pattern Recogn. Lett., 2015, 65: 109-115.

[6]

Feng G, An B, Yang F, et al.. Relevance popularity: a term event model based feature selection scheme for text classification. PLoS ONE, 2017, 12(4): e0174,341.

[7]

Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Mach. Learn., 1997, 29(2): 131-163.

[8]

Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, http://www.deeplearningbook.org (2016)

[9]

Jurafsky D, Martin JH. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2009. New Jersey, Prentice-Hall

[10]

Li J, Cui W. A new classifier for imbalanced data based on a generalized density ratio model. Commun. Math. Stat., 2022.

[11]

Li R, Xu K, Zhou Y, et al.. Testing the effects of high-dimensional covariates via aggregating cumulative covariances. J. Am. Statist. Assoc., 2022, 10(1080/01621459): 2044334

[12]

Torii M, Fan J, Yang W, et al.. Risk factor detection for heart disease by applying text analytics in electronic medical records. J. Biomed. Inform., 2015, 58: S164-S170.

[13]

Weisberg S. Applied Linear Regression, 2013, 4, New Jersey, Wiley

[14]

Wu X, Kumar V. The top ten algorithms in data mining, 2009. Boca Raton, CRC Press.

[15]

Zheng F, Webb GI, Suraweera P, et al.. Subsumption resolution: an efficient and effective technique for semi-naive bayesian learning. Mach. Learn., 2012, 87(1): 93-125.

[16]

Zhong PS, Chen S. Tests for high-dimensional regression coefficients with factorial designs. J. Am. Stat. Assoc., 2011, 106(493): 260-274.

[17]

Zhou X, Wang Y, Zeng D. Multicategory classification via forward-backward support vector machine. Commun. Math. Statist., 2020, 8(3): 319-339.

[18]

Zhu Y, Bradic J. Significance testing in non-sparse high-dimensional linear models. Electron. J. Statist., 2018, 12(2): 3312-3364.

RIGHTS & PERMISSIONS

School of Mathematical Sciences, University of Science and Technology of China and Springer-Verlag GmbH Germany, part of Springer Nature

PDF

0

Accesses

0

Citation

Detail

Sections
Recommended

/