Parallel naive Bayes algorithm for large-scale Chinese text classification based on spark

Peng Liu , Hui-han Zhao , Jia-yu Teng , Yan-yan Yang , Ya-feng Liu , Zong-wei Zhu

Journal of Central South University ›› 2019, Vol. 26 ›› Issue (1) : 1 -12.

PDF
Journal of Central South University ›› 2019, Vol. 26 ›› Issue (1) : 1 -12. DOI: 10.1007/s11771-019-3978-x
Article

Parallel naive Bayes algorithm for large-scale Chinese text classification based on spark

Author information +
History +
PDF

Abstract

The sharp increase of the amount of Internet Chinese text data has significantly prolonged the processing time of classification on these data. In order to solve this problem, this paper proposes and implements a parallel naive Bayes algorithm (PNBA) for Chinese text classification based on Spark, a parallel memory computing platform for big data. This algorithm has implemented parallel operation throughout the entire training and prediction process of naive Bayes classifier mainly by adopting the programming model of resilient distributed datasets (RDD). For comparison, a PNBA based on Hadoop is also implemented. The test results show that in the same computing environment and for the same text sets, the Spark PNBA is obviously superior to the Hadoop PNBA in terms of key indicators such as speedup ratio and scalability. Therefore, Spark-based parallel algorithms can better meet the requirement of large-scale Chinese text data mining.

Keywords

Chinese text classification / naive Bayes / spark / hadoop / resilient distributed dataset / parallelization

Cite this article

Download citation ▾
Peng Liu, Hui-han Zhao, Jia-yu Teng, Yan-yan Yang, Ya-feng Liu, Zong-wei Zhu. Parallel naive Bayes algorithm for large-scale Chinese text classification based on spark. Journal of Central South University, 2019, 26(1): 1-12 DOI:10.1007/s11771-019-3978-x

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

ZhouL-j, WangH, WangW-bo. Parallel implementation of classification algorithms based on cloud computing environment [J]. Telkomnika Indonesian Journal of Electrical Engineering, 2012, 10(5): 1087-1092

[2]

LiuB-w, BlaschE, ChenY, ShenD, ChenG-she. Scalable sentiment classification for big data analysis using Naive Bayes classifier [C]. IEEE International Conference on Big Data. Silicon. Valley, CA, USA: IEEE, 2013, 194(101): 99-104

[3]

GroppW, LuskE, SkjellumAUsing MPI: Portable parallel programming with the message-passing interface [M], 1999, MIT Press, Cambridge

[4]

BermanF, FoxG, HeyAGrid Computing: Making the global infrastructure a reality [M], 2003, Wiley & Sons, Hoboken, NJ, USA

[5]

ZhangQ, ChengL, BoutabaR. Cloud computing: State-of-the-art and research challenges [J]. Journal of Internet Services and Applications, 2010, 1(1): 7-18

[6]

WhiteTHadoop: The definitive guide [M], 2009, O’Reilly Media, Inc, Sebastopol, CA, USA

[7]

DeanJ, GhemawatS. MapReduce: Simplified data processing on large clusters [J]. Communications of the ACM, 2008, 51(1): 107-113

[8]

SriramaS N, BatrashevO, JakovitsP, VainikkoE. Scalability of parallel scientific applications on the cloud [J]. Scientific Programming, 2011, 19(2): 91-105

[9]

ZahariaM, ChowdhuryM, FranklinM J, ShenkerS, StoicaI. Spark: cluster computing with working sets [C]. Proceeding HotCloud’10 Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, 2010, USENIX Association Berkeley, Boston: 18

[10]

ZahariaM, ChowdhuryM, DasT, DaveA, MaJ, McauleyM, FranklinM J, HenkerS, StoicaI. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing [C]. Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, 2012, USENIX Association Berkeley, San Jose: 115

[11]

JiangT, ZhangQ-l, HouRui. Understanding the behavior of in-memory computing workloads [C]. Proceedings of 2014 IEEE International Symposium on Workload Characterization. Raleigh, NC, 20142230

[12]

Reyes-OrtizJ L, OnetoL, AnguitaD. Big data analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf [C]. Proceedings of the INNS Conference on Big Data 2015 Program San Francisco. Francisco, CA, 2015121130

[13]

LiuZ-q, GuR, YuanC-f, HuangY-hua. Parallelization of classification algorithms based on SparkR [J]. Journal of Frontiers of Computer Science and Technology, 2015, 9(11): 1281-1294

[14]

YanB, YangZ-j, RenY-t, TanX, EricL. Microblog sentiment classification using parallel SVM in apache spark [C]. Proceeding of 2017 IEEE International Congress on Big Data, 2017, IEEE, Honolulu, USA: 282288

[15]

LiuP, TengJ-y, DingE-j, MengLei. Parallel k-means algorithm for massive texts on Spark. [J]. Journal of Chinese Information Processing, 2017, 31(4): 145-153

[16]

TomasP, VirginijusM. Application of logistic regression with part-of-the-speech tagging for multi-class text classification [C]. Proceeding of the 2016 IEEE 4th Workshop on Advances in Information, Electronic and Electrical Engineering, 2016, IEEE, Vilnius, Lithuania: 15

[17]

CataldoM. Enhanced vector space models for contentbased recommender systems [C]. Proceeding of the Fourth ACM Conference on Recommender Systems, 2010, ACM., New York, USA: 361364

[18]

Nature language processing & information retrieval [EB/OL] 2016. http://ictclas.nlpir.org/.

[19]

LongJ, WangL-d, LiZ-d, ZhangZ-p, YangLiu. WordNet-based lexical semantic classification for text corpus analysis [J]. Journal of Central South University, 2015, 22: 1833-1840

[20]

ZhangW, YoshidaT, TangX-jin. A comparative study of TF*IDF, LSI and multi-words for text classification [J]. Expert Systems with Applications, 2011, 38(3): 2758-2765

[21]

OwenS, AnilR, DunningT. Mahout in action [M]. Manning Publications, 2010

[22]

VikasK V, BinduK R, LathaP A. Comprehensive study of text classification algorithms [C]. Proceeding of 2017 International Conference on Advances in Computing, Communications and Informatics, 2017, IEEE, Udupi, India: 11091113

[23]

RennieJ D, ShihL, TeevanJ. Tackling the poor assumptions of naive Bayes text classifiers [C]. Proceedings of the Twentieth International Conference on Machine Learning (ICML). Washington, DC, 2003661623

[24]

SunX, RoverD. Scalability of parallel algorithmmachine combinations [J]. IEEE Trans Parallel and Distributed System, 1994, 5(6): 599-613

AI Summary AI Mindmap
PDF

142

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/