Topic discovery and evolution in scientific literature based on content and citations
Hou-kui ZHOU, Hui-min YU, Roland HU
Topic discovery and evolution in scientific literature based on content and citations
Researchers across the globe have been increasingly interested in the manner in which important research topics evolve over time within the corpus of scientific literature. In a dataset of scientific articles, each document can be considered to comprise both the words of the document itself and its citations of other documents. In this paper, we propose a citation- content-latent Dirichlet allocation (LDA) topic discovery method that accounts for both document citation relations and the con-tent of the document itself via a probabilistic generative model. The citation-content-LDA topic model exploits a two-level topic model that includes the citation information for ‘father’ topics and text information for sub-topics. The model parameters are estimated by a collapsed Gibbs sampling algorithm. We also propose a topic evolution algorithm that runs in two steps: topic segmentation and topic dependency relation calculation. We have tested the proposed citation-content-LDA model and topic evolution algorithm on two online datasets, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) and IEEE Computer Society (CS), to demonstrate that our algorithm effectively discovers important topics and reflects the topic evolution of important research themes. According to our evaluation metrics, citation-content-LDA outperforms both content-LDA and citation-LDA.
Topic extraction / Topic evolution / Evaluation method
[1] |
Ahmed , A., Xing , E.P., 2010. Timeline: a dynamic hierarchical Dirichlet process model for recovering birth/death and evolution of topics in text stream. Proc. 26th Conf. on Uncertainty in Artificial Intelligence, p.20–29.
|
[2] |
Blei , D.M., Lafferty , J.D., 2006. Dynamic topic models. Proc. 23rd ACM Int. Conf. on Machine Learning, p.113–120. https://doi.org/10.1145/1143844.1143859
|
[3] |
Blei , D.M., Ng , A.Y., Jordan , M.I., 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. , 3:993–1022.
|
[4] |
Brin , B.S., Page , L., 1998. The anatomy of a large scale hy-pertextual web search engine. Comput. Netw. ISDN Syst. , 30(98):107–117. https://doi.org/10.1016/S0169-7552(98)00110-X
|
[5] |
Chang , J., Blei , D.M., 2009. Relational topic models for document networks. Proc. 12th Int. Conf. on Artificial Intelligence and Statistics, p.81–88.
|
[6] |
Cohn , D., Chang , H., 2000. Learning to probabilistically identify authoritative documents. Proc. 17th Int. Conf. on Machine Learning, p.167–174.
|
[7] |
Dietz , L., Bickel , S., Scheffer , T., 2007. Unsupervised predic-tion of citation influences. Proc. 24th ACM Int. Conf. on Machine Learning, p.233–240. https://doi.org/10.1145/1273496.1273526
|
[8] |
Erosheva , E., Fienberg , S., Lafferty , J., 2004. Mixed- membership models of scientific publications. PNAS, 101(Suppl 1):5220–5227. https://doi.org/10.1073/pnas.0307760101
|
[9] |
Griffiths , T.L., Steyvers , M., 2004. Finding scientific topics. PNAS, 101(Suppl 1):5228–5235. https://doi.org/10.1073/pnas.0307752101
|
[10] |
Guo , Z., Zhang , Z., Zhu , S.,
|
[11] |
He , Q., Chen , B., Pei , J.,
|
[12] |
Hofmann , T., 2001. Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. , 42(1-2):177–196. https://doi.org/10.1023/A:1007617005950
|
[13] |
Lin , F.R., Huang , F.M., Liang , C.H., 2007. Individualized storyline-based news topic retrospection. Pacific Asia Conf. on Information Systems, Article 140.
|
[14] |
Lu , Z., Mamoulis , N., Cheung , D.W., 2014. A collective topic model for milestone paper discovery. Proc. 37th Int. ACM SIGIR Conf. on Research & Development in In-formation Retrieval, p.1019–1022. https://doi.org/10.1145/2600428.2609499
|
[15] |
Macroberts , M.H., Macroberts , B.R., 1989. Problems of cita-tion analysis: a critical review. J. Am. Soc. Inform. Sci., 40(5):342–349. https://doi.org/10.1002/(SICI)1097-4571(198909)40:5<342::AID-ASI7>3.0.CO;2-U
|
[16] |
Mei , Q., Zhai , C., 2005. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. Proc. 11th ACM SIGKDD Int. Conf. on Knowledge Discovery in Data Mining, p.198–207. https://doi.org/10.1145/1081870.1081895
|
[17] |
Mei , Q., Cai , D., Zhang , D.,
|
[18] |
Nallapati , R., Cohen , W.W., 2008. Link-PLSA-LDA: a new unsupervised model for topics and influence of blogs. Proc. 2nd Int. Conf. on Weblogs and Social Media, p.84–92.
|
[19] |
Nallapati , R.M., Ahmed , A., Xing , E.P.,
|
[20] |
Wang , X.L., Zhai , C.X., Roth , D., 2013. Understanding evo-lution of research themes: a probabilistic generative model for citations. Proc. 19th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.1115–1123. https://doi.org/10.1145/2487575.2487698
|
[21] |
Wang , X.R., McCallum , A., 2006. Topics over time: a non-Markov continuous-time model of topical trends. Proc. 12th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.424–433. https://doi.org/10.1145/1150402.1150450
|
/
〈 | 〉 |