Optimal Number of Topics in Topic Modeling Using Deep Auto Encoder Graph Regularized Non-Negative Matrix Factorization Algorithm

Pooja Kherwa , Jyoti Arora

Journal of Systems Science and Systems Engineering ›› 2024, Vol. 34 ›› Issue (3) : 257 -283.

PDF
Journal of Systems Science and Systems Engineering ›› 2024, Vol. 34 ›› Issue (3) : 257 -283. DOI: 10.1007/s11518-024-5639-3
Article

Optimal Number of Topics in Topic Modeling Using Deep Auto Encoder Graph Regularized Non-Negative Matrix Factorization Algorithm

Author information +
History +
PDF

Abstract

Topic modeling stands as a well-explored and foundational challenge in the text mining domain. Traditional topic schemes based on word co-occurrences, aim to expose the latent semantic structure embedded in a document corpus. Nevertheless, the inherent brevity of short texts introduces data sparsity, hindering the effectiveness of conventional topic models and yielding suboptimal outcomes for such text. Typically, short texts encompass a restricted number of topics, necessitating a grasp of relevant background knowledge for a comprehensive understanding of semantic content. Motivated by the observed information, this research introduces a novel Deep Auto encoder Graph Regularized Non-negative Matrix Factorization algorithm (DAGR-NMF) to uncover significant and meaningful topics within short document contents. The three main phases of proposed work are preprocessing, feature extraction and topic modeling. Initially, the data are preprocessed using natural language preprocessing tasks such as stop word removal, stemming and lemmatizing. Then, feature extraction is performed using hybrid Absolute Deviation Factors-Class Term Frequency (ADF-CTF) to capture the most relevant information from the text. Finally, topic modeling task is executed using proposed DAGR-NMF approach. Experimental findings demonstrate that the introduced DAGR-NMF model outperforms all other techniques by achieving NMI values of 0.852, 0.857, 0.793, and 0.831 on associated press, political blog datasets, 20NewsGroups, and News category dataset, respectively.

Keywords

Topic modeling / natural language processing / non-negative matrix factorization / purity and topic coherence

Cite this article

Download citation ▾
Pooja Kherwa, Jyoti Arora. Optimal Number of Topics in Topic Modeling Using Deep Auto Encoder Graph Regularized Non-Negative Matrix Factorization Algorithm. Journal of Systems Science and Systems Engineering, 2024, 34(3): 257-283 DOI:10.1007/s11518-024-5639-3

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

AdamicL A, GlanceN. The political blogosphere and the 2004 US election: Divided they blog. Proceedings of the 3rd International Workshop on Link Discovery (LinkKDD’ 05), 2005 Chicago, Illinois, USA, August 21 - 25, 2005

[2]

AmiriB, KarimianghadimR. A novel text clustering model based on topic modelling and social network analysis. Chaos, Solitons & Fractals, 2024, 181: 114633

[3]

AthukoralaS, MohottiW. An effective short-text topic modelling with neighbourhood assistance-driven NMF in Twitter. Social Network Analysis and Mining, 2022, 12(1): 89

[4]

BaiX, ChenB, ZhuoZ. Dual-learning multi-hop nonnegative matrix factorization for community detection. Neural Networks, 2024, 176: 106360

[5]

CuriskisS A, DrakeB, OsbornT R, KennedyP J. An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit. Information Processing & Management, 2020, 57(2): 102034

[6]

DongY, CheH, LeungM F, LiuC, YanZ. Centric graph regularized log-norm sparse non-negative matrix factorization for multi-view clustering. Signal Processing, 2024, 217: 109341

[7]

EggerR, YuJ. A topic modeling comparison between LDA, NMF, Top2vec, BERTopic to demystify Twitter posts. Frontiers in Sociology, 2022, 7: 886498

[8]

FengJ, ZhangZ, DingC, RaoY, XieH, WangF L. Context reinforced neural topic modeling over short texts. Information Sciences, 2022, 607: 79-91

[9]

FuQ, ZhuangY, GuJ, ZhuY, GuoX Agreeing to disagree: Choosing among eight topic-modeling methods, 2021

[10]

GaoM, HaddockJ, MolitorD, NeedellD, SadovnikE, WillT, ZhangR. Neural nonnegative matrix factorization for hierarchical multilayer topic modeling. 2019 IEEE 8th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), 2019 Le gosier, Guadeloupe, December 15–18, 2019

[11]

GaoW, LiL, TaoX, ZhouJ, TaoJ. Identifying informative tweets during a pandemic via a topic-aware neural language model. World Wide Web, 2023, 26(1): 55-70

[12]

GeorgeL, SumathyP. An integrated clustering and bert framework for improved topic modeling. International Journal of Information Technology, 2023, 15(4): 2187-2195

[13]

GrootendorstM Bertopic: Neural topic modeling with a class-based TF-IDF procedure, 2022 arXiv preprint arXiv:2203.05794

[14]

HarmanD K The first text retrieval conference (TREC-1), 1993 Gaithersburg, MD US Department of Commerce, National Institute of Standards and Technology

[15]

JannesariV, KeshvariM, BerahmandK. A novel nonnegative matrix factorization-based model for attributed graph clustering by incorporating complementary information. Expert Systems with Applications, 2024, 242: 122799

[16]

JinL, ZhangL, ZhaoL. Feature selection based on absolute deviation factor for textclassification. Information Processing & Management, 2023, 60(3): 103251

[17]

KinariwalaS, DeshmukhS. Short text topic modelling using local and global word-context semantic correlation. Multimedia Tools and Applications, 2023, 82(17): 26411-26433

[18]

LiuQ, ZhengZ, ZhengJ, ChenQ, LiuG, ChenS, ChuB, ZhuH, AkinwunmiB, HuangJ, et al.. Health communication through news media during the early stage of the COVID-19 outbreak in China: Digital topic modeling approach. Journal of Medical Internet Research, 2020, 22(4): e19118

[19]

LiuY, ChenM. The knowledge structure and development trend in artificial intelligence based on latent feature topic model. IEEE Transactions on Engineering Management, 2023, 71: 12593-12604

[20]

MisraR News category dataset, 2022 arXiv Preprint arXiv:2209.11429

[21]

MuaadA Y, DavanagereH J, Al-antariM A, BenifaJ B, CholaC. AI-based misogyny detection from Arabic levantine Twitter tweets. Computer Sciences & Mathematics Forum, 2021, 2: 15

[22]

MurshedB A H, AbawajyJ, MallappaS, SaifM A N, Al-GhuribiS M, GhanemF A. Enhancing big social media data quality for use in short-text topic modeling. IEEE Access, 2022, 10: 105328-105351

[23]

OjoA O, BouguilaN. A topic modeling and image classification framework: The generalized Dirichlet variational autoencoder. Pattern Recognition, 2024, 146: 110037

[24]

Ortiz-BouzaM, AviyenteS. Community detection in multiplex networks based on orthogonal nonnegative matrix tri-factorization. IEEE Access, 2024, 12: 6423-6436

[25]

OzyurtB, AkcayolM A. A new topic modeling based approach for aspect extraction in aspect based sentiment analysis: SS-LDA. Expert Systems with Applications, 2021, 168: 114231

[26]

QuilleR V E, BarrosJ M, JúniorM B, De AlmeidaF V, CorrêaP L P. Detecting favorite topics in computing scientific literature via dynamic topic modeling. IEEE Access, 2023, 11: 41535-41545

[27]

Rajendra PrasadK, MohammedM, NoorullahR. Visual topic models for healthcare data clustering. Evolutionary Intelligence, 2021, 14(2): 545-562

[28]

RaniR, LobiyalD. An extractive text summarization approach using tagged-LDA basedopic modeling. Multimedia Tools and Applications, 2021, 80: 3275-3305

[29]

RashidJ, KimJ, HussainA, NaseemU. WETM: A word embedding-based topic model with modified collapsed gibbs sampling for short text. Pattern Recognition Letters, 2023, 172: 158-164

[30]

SmailB, AlianeH, AbdeldjalilO. Using an explicit query and a topic model for scientific article recommendation. Education and Information Technologies, 2023, 28(12): 15657-15670

[31]

ThielmannA, WeisserC, KrenzA, SäfkenB. Unsupervised document classification integrating Web scraping, one-class SVM and LDA topic modelling. Journal of Applied Statistics, 2023, 50(3): 574-591

[32]

UgochiO, PrasadR, OduN, OgidiakaE, IbrahimB H. Customer opinion mining in electricity distribution company using Twitter topic modeling and logistic regression. International Journal of Information Technology, 2022, 14(4): 2005-2012

[33]

WangJ, ZhangX L. Deep NMF topic modeling. Neurocomputing, 2023, 515: 157-173

[34]

WangQ, ZhuC, ZhangY, ZhongH, ZhongJ, ShengV S. Short text topic learning using heterogeneous information network. IEEE Transactions on Knowledge and Data Engineering, 2022, 35(5): 5269-5281

[35]

WuX, LuuA T, DongX Mitigating data sparsity for short text topic modeling by topic-semantic contrastive learning, 2022 arXiv Preprint arXiv:2211.12878

[36]

YangM, XuS. Orthogonal nonnegative matrix factorization using a novel deep autoencoder network. Knowledge-Based Systems, 2021, 227: 107236

[37]

ZadgaonkarA, AgrawalA J. An approach for analyzing unstructured text data using topic modeling techniques for efficient information extraction. New Generation Computing, 2024, 42(1): 109-134

[38]

ZhangW, YuS, WangL, GuoW, LeungM F. Constrained symmetric non-negative matrix factorization with deep autoencoders for community detection. Mathematics, 2024, 12(10): 1554

RIGHTS & PERMISSIONS

Systems Engineering Society of China and Springer-Verlag GmbH Germany

AI Summary AI Mindmap
PDF

157

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/