PSLDA: a novel supervised pseudo document-based topic model for short texts
Mingtao SUN, Xiaowei ZHAO, Jingjing LIN, Jian JING, Deqing WANG, Guozhu JIA
PSLDA: a novel supervised pseudo document-based topic model for short texts
Various kinds of online social media applications such as Twitter and Weibo, have brought a huge volume of short texts. However, mining semantic topics from short texts efficiently is still a challenging problem because of the sparseness of word-occurrence and the diversity of topics. To address the above problems, we propose a novel supervised pseudo-document-based maximum entropy discrimination latent Dirichlet allocation model (PSLDA for short). Specifically, we first assume that short texts are generated from the normal size latent pseudo documents, and the topic distributions are sampled from the pseudo documents. In this way, the model will reduce the sparseness of word-occurrence and the diversity of topics because it implicitly aggregates short texts to longer and higher-level pseudo documents. To make full use of labeled information in training data, we introduce labels into the model, and further propose a supervised topic model to learn the reasonable distribution of topics. Extensive experiments demonstrate that our proposed method achieves better performance compared with some state-of-the-art methods.
supervised topic model / short text / pseudo-document
Mingtao Sun is a PhD candidate in School of Economics and Management, Beihang University, China. His research interests include Big Data processing and Education Administration
Xiaowei Zhao is currently pursuing the PhD degree in Computer Science with Beihang University, China. Her main research interests include transfer learning and sentiment analysis
Jingjing Lin is currently a senior student at the School of Instrumentation and Optoelectronic Engineering, Beihang University, China. Her research interests include text classification, natural language inference, and sentiment analysis
Jian Jing received the MS degree in the Engineering of Computer Techonlogy from the Beihang University, China in 2021. His research interests include knowledge reasoning, algorithms and big data processing
Deqing Wang received the PhD degree in computer science from Beihang University, China in 2013. He is currently an Associate Professor with the School of Computer Science and the Deputy Chief Engineer with the National Engineering Research Center for Science Technology Resources Sharing and Service, Beihang University, China. His research focuses on text categorization and data mining for software engineering and machine learning
Guozhu Jia received the PhD degree from Aalborg University, Denmark. He is currently a Professor of School of Economics and Management, Beihang University, China and a member of Expert Committee of China Manufacturing Servitization Alliance. He is also a director of China Innovation Method Society
[1] |
Rosso P , Errecalde M , Pinto D . Analysis of short texts on the web: introduction to special issue. Language Resources and Evaluation, 2013, 47( 1): 123– 126
|
[2] |
Hofmann T. Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1999, 50– 57
|
[3] |
Blei D M , Ng A Y , Jordan M I . Latent Dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 993– 1022
|
[4] |
Li Z , Zhang H , Wang S , Huang F , Li Z , Zhou J . Exploit latent Dirichlet allocation for collaborative filtering. Frontiers of Computer Science, 2018, 12( 3): 571– 581
|
[5] |
Chen W , Cai F , Chen H , De Rijke M . Personalized query suggestion diversification in information retrieval. Frontiers of Computer Science, 2020, 14( 3): 143602
|
[6] |
Miyazawa S , Song X , Xia T , Shibasaki R , Kaneda H . Integrating GPS trajectory and topics from twitter stream for human mobility estimation. Frontiers of Computer Science, 2019, 13( 3): 460– 470
|
[7] |
Hong L Davison B D. Empirical study of topic modeling in twitter. In: Proceedings of the 1st Workshop on Social Media Analytics. 2010, 80– 88
|
[8] |
Davison B D Suel T Craswell N Liu B. WSDM’10: Third ACM International Conference on Web Search and Data Mining. New York: ACM, 2010
|
[9] |
Mehrotra R Sanner S Buntine W Xie L. Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2013, 889– 892
|
[10] |
Phan X H , Nguyen C T , Le D T , Nguyen L M , Horiguchi S , Ha Q T . A hidden topic-based framework toward building applications with short Web documents. IEEE Transactions on Knowledge and Data Engineering, 2011, 23( 7): 961– 976
|
[11] |
Quan X Kit C Ge Y Pan S J. Short and sparse text topic modeling via self-aggregation. In: Proceedings of the 24th International Conference on Artificial Intelligence. 2015, 2270– 2276
|
[12] |
Zuo Y Wu J Zhang H Lin H Wang F Xu K Xiong H. Topic modeling of short texts: a pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, 2105– 2114
|
[13] |
Blei D M Lafferty J D. Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning. 2006, 113– 120
|
[14] |
Meek C Chickering M Halpern J. Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. Banff: AUAI Press, 2004
|
[15] |
Nguyen D Q , Billingsley R , Du L , Johnson M . Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics, 2015, 3: 299– 313
|
[16] |
Zhao F , Zhu Y , Jin H , Yang L T . A personalized hashtag recommendation approach using lda-based topic model in microblog environment. Future Generation Computer Systems, 2016, 65: 196– 206
|
[17] |
Ibeke E Lin C Wyner A Barawi M H. Extracting and understanding contrastive opinion through topic relevant sentences. In: Proceedings of the 8th International Joint Conference on Natural Language Processing. 2017, 395– 400
|
[18] |
Tian C , Rong W , Zhou S , Zhang J , Ouyang Y , Xiong Z . Learning word representation by jointly using neighbor and syntactic contexts. Neurocomputing, 2021, 456: 136– 146
|
[19] |
Weng J Lim E P Jiang J He Q. TwitterRank: finding topic-sensitive influential twitterers. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. 2010, 261– 270
|
[20] |
Jin O Liu N N Zhao K Yu Y Yang Q. Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, 775– 784
|
[21] |
Lin T Tian W Mei Q Cheng H. The dual-sparse topic model: mining focused topics and focused terms in short text. In: Proceedings of the 23rd International Conference on World Wide Web. 2014, 539– 550
|
[22] |
Cheng X , Yan X , Lan Y , Guo J . BTM: topic modeling over short texts. IEEE Transactions on Knowledge and Data Engineering, 2014, 26( 12): 2928– 2941
|
[23] |
Zuo Y , Zhao J , Xu K . Word network topic model: a simple but general solution for short and imbalanced texts. Knowledge and Information Systems, 2016, 48( 2): 379– 398
|
[24] |
Yin J Wang J. A dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014, 233– 242
|
[25] |
Li C Wang H Zhang Z Sun A Ma Z. Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2016, 165– 174
|
[26] |
Li X , Li C , Chi J , Ouyang J . Short text topic modeling by exploring original documents. Knowledge and Information Systems, 2018, 56( 2): 443– 462
|
[27] |
Bicalho P , Pita M , Pedrosa G , Lacerda A , Pappa G L . A general framework to expand short text for topic modeling. Information Sciences, 2017, 393: 66– 81
|
[28] |
Pedrosa G Pita M Bicalho P Lacerda A Pappa G L. Topic modeling for short texts with co-occurrence frequency-based expansion. In: Proceedings of the 5th Brazilian Conference on Intelligent Systems (BRACIS). 2016, 277– 282
|
[29] |
Shi T Kang K Choo J Reddy C K. Short-text topic modeling via non-negative matrix factorization enriched with local word-context correlations. In: Proceedings of 2018 World Wide Web Conference. 2018, 1105– 1114
|
[30] |
Miao Y Yu L Blunsom P. Neural variational inference for text processing. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1727– 1736
|
[31] |
Ding R Nallapati R Xiang B. Coherence-aware neural topic modeling. In: Proceedings of 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 830– 836
|
[32] |
Zhu J Xing E P. Sparse topical coding. In: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence. 2011, 831– 838
|
[33] |
Card D Tan C Smith N A. Neural models for documents with metadata. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 2031– 2040
|
[34] |
Zhu J , Chen N , Perkins H , Zhang B . Gibbs max-margin topic models with data augmentation. The Journal of Machine Learning Research, 2014, 15( 1): 1073– 1110
|
[35] |
Michael J R , Schucany W R , Haas R W . Generating random variates using transformations with multiple roots. The American Statistician, 1976, 30( 2): 88– 90
|
[36] |
Dua D Graff C. UCI machine learning repository. See archiveics.uci.edu/ml/index website, 2017
|
[37] |
Zubiaga A Ji H. Harnessing web page directories for large-scale classification of tweets. In: Proceedings of the 22nd International Conference on World Wide Web. 2013, 225– 226
|
[38] |
Phan X H Nguyen C T. GibbsLDA++: A C/C++ implementation of latent dirichlet allocation (LDA). Boston: Free Software Foundation, 2007
|
[39] |
Blei D M McAuliffe J D. Supervised topic models. In: Proceedings of the 20th International Conference on Neural Information Processing Systems. 2007, 121– 128
|
[40] |
Chong W Blei D Li F F. Simultaneous image classification and annotation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2009, 1903– 1910
|
[41] |
Zhu J , Ahmed A , Xing E P . MedLDA: maximum margin supervised topic models. The Journal of Machine Learning Research, 2012, 13( 1): 2237– 2278
|
[42] |
Pedregosa F , Varoquaux G , Gramfort A , Michel V , Thirion B , Grisel O , Blondel M , Prettenhofer P , Weiss R , Dubourg V , Vanderplas J , Passos A , Cournapeau D , Brucher M , Perrot M , Duchesnay É . Scikit-learn: machine learning in Python. The Journal of Machine Learning Research, 2011, 12: 2825– 2830
|
[43] |
Röder M Both A Hinneburg A. Exploring the space of topic coherence measures. In: Proceedings of the 8th ACM International Conference on Web Search and Data Mining. 2015, 399– 408
|
/
〈 | 〉 |