Online belief propagation algorithm for probabilistic latent semantic analysis

Yun YE; Shengrong GONG; Chunping LIU; Jia ZENG; Ning JIA; Yi ZHANG

doi:10.1007/s11704-013-2360-7

PDF(866 KB)

Front. Comput. Sci. ›› 2013, Vol. 7 ›› Issue (4) : 526-535. DOI: 10.1007/s11704-013-2360-7

RESEARCH ARTICLE

Online belief propagation algorithm for probabilistic latent semantic analysis

Author information +

History +

Abstract

Probabilistic latent semantic analysis (PLSA) is a topic model for text documents, which has been widely used in text mining, computer vision, computational biology and so on. For batch PLSA inference algorithms, the required memory size grows linearly with the data size, and handling massive data streams is very diffcult. To process big data streams, we propose an online belief propagation (OBP) algorithm based on the improved factor graph representation for PLSA. The factor graph of PLSA facilitates the classic belief propagation (BP) algorithm. Furthermore, OBP splits the data stream into a set of small segments, and uses the estimated parameters of previous segments to calculate the gradient descent of the current segment. Because OBP removes each segment from memory after processing, it is memory effcient for big data streams. We examine the performance of OBP on four document data sets, and demonstrate that OBP is competitive in both speed and accuracy for online expectation maximization (OEM) in PLSA, and can also give a more accurate topic evolution. Experiments on massive data streams from Baidu further confirm the effectiveness of the OBP algorithm.

Keywords

probabilistic latent semantic analysis / topic models / expectation maximization / belief propagation

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Yun YE, Shengrong GONG, Chunping LIU, Jia ZENG, Ning JIA, Yi ZHANG. Online belief propagation algorithm for probabilistic latent semantic analysis. Front Comput Sci, 2013, 7(4): 526‒535 https://doi.org/10.1007/s11704-013-2360-7

This is a preview of subscription content, contact us for subscripton.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Salton G, Wong A, Yang C S. A vector space model for automatic indexing. Communications of the ACM, 1975, 18(11): 613-620 CrossRef Google scholar

[2]	Thomas K, Landauer P W F, Laham A F. An introduction to latent semantic analysis. Communications of the ACM, 1998, 25: 259-284

[3]	Hoffman T. Probabilistic latent semantic analysis: uncertainty in arti- ficial intelligence. 1999

[4]	Blei DM, Ng A Y, Jordan MI. Latent dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 993-1022

[5]	Canini K R, Shi L, Griffiths T L. Online inference of topics with latent dirichlet allocation. In: Proceedings of the 2009 International Conference on Artificial Intelligence and Statistics. 2009, 65-72

[6]	Zeng J, Cheung W K, Liu J. Learning topic models by belief propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 1

[7]	Zhuang L, She L, Jiang Y, Tang K, Yu N. Image classification via semi-supervised PLSA. In: Proceedings of the 5th International Conference on Image and Graphics. 2009, 205-208

[8]	Xu J, Ye G, Wang Y, Wang W, Yang J. Online learning for plsa-based visual recognition. Computer Vision-ACCV 2010, 2011, 95-108

[9]	AlSumait L, Barbará D, Domeniconi C. On-line LDA: adaptive topic models for mining text streams with applications to topic detection and tracking. In: Proceedings of the 8th IEEE International Conference on Data Mining. 2008, 3-12

[10]	Yao L, Mimno D, McCallum A. Efficient methods for topic model inference on streaming document collections. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009, 937-946 CrossRef Google scholar

[11]	Hoffman M D, Blei D M, Bach F. Online learning for latent Dirichlet allocation. Advances in Neural Information Processing Systems, 2010, 23: 856-864

[12]	Wang C, Paisley J, Blei D M. Online variational inference for the hierarchical dirichlet process. In: Proceedings of the 14th Intenational Conference on Artificial Intelligence and Statistics. 2011, 752-760

[13]	Banerjee A, Basu S. Topic models over text streams: a study of batch and online unsupervised learning. In: Proceedings of the 2007 SIAM International Conference on Data Mining. 2007, 431-436

[14]	Nair V, Clark J J. An unsupervised, online learning framework for moving object detection. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2004, II-317-II-324 CrossRef Google scholar

[15]	Pham M T, Cham T J. Online learning asymmetric boosted classifiers for object detection. In: Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition. 2007, 1-8 CrossRef Google scholar

[16]	Shalev-Shwartz S, Singer Y, Ng A Y. Online and batch learning of pseudo-metrics. In: Proceedings of the 21st International Conference on Machine Learning. 2004

[17]	Mairal J, Bach F, Ponce J, Sapiro G. Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 2010, 11: 19-60

[18]	Vijayakumar S, D’souza A, Schaal S. Incremental online learning in high dimensions. Neural Computation, 2005, 17(12): 2602-2634 CrossRef Google scholar

[19]	Kivinen J, Smola A J, Williamson R C. Online learning with kernels. IEEE Transactions on Signal Processing, 2004, 52(8): 2165-2176 CrossRef Google scholar

[20]	Xu J, Ye G, Wang Y, Herman G, Zhang B, Yang J. Incremental EMfor probabilistic latent semantic analysis on human action recognition. In: Proceedings of the 6th IEEE International Conference on Advanced Video and Signal Based Surveillance. 2009, 55-60

[21]	Singh M, Khan F U. Effect of incremental EM on document summarization using probabilistic latent semantic analysis. Lecture Notes in Engineering and Computer Science, 2012, 2198

[22]	Bottou L. Online learning and stochastic approximations. On-line Learning in Neural Networks, 1998, 17: 9-42

[23]	Zhu S, Zeng J, Mamitsuka H. Enhancing medline document clustering by incorporating mesh semantic similarity. Bioinformatics, 2009, 25(15): 1944-1951 CrossRef Google scholar

[24]	Globerson A, Chechik G, Pereira F, Tishby N. Euclidean embedding of co-occurrence data. Journal of Machine Learning Research, 2007, 8: 2047-2076

[25]	Eisenstein J, Xing E. The CMU 2008 political blog corpus. Machine Learning Department, School of Computer Science, Carnegie Mellon University, 2010

[26]	Porteous I, Newman D, Ihler A, Asuncion A, Smyth P, Welling M. Fast collapsed gibbs sampling for latent dirichlet allocation. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008, 569-577 CrossRef Google scholar