Enriching short text representation in microblog for clustering

Jiliang TANG; Xufei WANG; Huiji GAO; Xia HU; Huan LIU

doi:10.1007/s11704-011-1167-7

Front. Comput. Sci. ›› 2012, Vol. 6 ›› Issue (1) :88 -101. DOI: 10.1007/s11704-011-1167-7

RESEARCH ARTICLE

Enriching short text representation in microblog for clustering

Author information +

History +

PDF (660KB)

Abstract

Social media websites allow users to exchange short texts such as tweets via microblogs and user status in friendship networks. Their limited length, pervasive abbreviations, and coined acronyms and words exacerbate the problems of synonymy and polysemy, and bring about new challenges to data mining applications such as text clustering and classification. To address these issues, we dissect some potential causes and devise an efficient approach that enriches data representation by employing machine translation to increase the number of features from different languages. Then we propose a novel framework which performs multi-language knowledge integration and feature reduction simultaneously through matrix factorization techniques. The proposed approach is evaluated extensively in terms of effectiveness on two social media datasets from Facebook and Twitter. With its significant performance improvement, we further investigate potential factors that contribute to the improved performance.

Keywords

short texts / text representation / multi-language knowledge / matrix factorization / social media

Cite this article

Download citation ▾

Jiliang TANG, Xufei WANG, Huiji GAO, Xia HU, Huan LIU. Enriching short text representation in microblog for clustering. Front. Comput. Sci., 2012, 6 (1) : 88-101 DOI:10.1007/s11704-011-1167-7

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Adamic L A, Zhang J, Bakshy E, Ackerman M S. Knowledge sharing and yahoo answers: everyone knows something. In: Proceedings of 17th International Conference on World Wide Web. 2008, 665-674

[2]	Hotho A, Staab S, Stumme G. Wordnet improves text document clustering. In: Proceedings of 2003 SIGIR Semantic WebWorkshop. 2003, 541-544

[3]	Reforgiato Recupero D. A new unsupervised method for document clustering by using WordNet lexical and conceptual relations. Information Retrieval, 2007, 10(6): 563-579

[4]	Hu J, Fang L, Cao Y, Zeng H J, Li H, Yang Q, Chen Z. Enhancing text clustering by leveraging Wikipedia semantics. In: Proceedings of 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2008, 179-186

[5]	Hu X, Zhang X, Lu C, Park E K, Zhou X. Exploiting Wikipedia as external knowledge for document clustering. In: Proceedings of 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009, 389-396

[6]	Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993-1022

[7]	Hofmann T. Probabilistic latent semantic indexing. In: Proceedings of 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1999, 50-57

[8]	Xu W, Liu X, Gong Y. Document clustering based on non-negative matrix factorization. In: Proceedings of 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2003, 267-273

[9]	Lin C J. Projected gradient methods for non-negative matrix factorization. Neural Computation, 2007, 19(10): 2756-2779

[10]	Cutting D R, Pedersen J O, Karger D R, Tukey J W. Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1992, 318-329

[11]	Dave K, Lawrence S, Pennock D M. Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: Proceedings of 12th International Conference on World Wide Web. 2003, 519-528

[12]	Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. In: Proceedings of 2000 KDD Workshop on Text Mining. 2000, 525-526

[13]	Banerjee S, Ramanathan K, Gupta A. Clustering short texts using Wikipedia. In: Proceedings of 30th Annual International ACM SIGIR Conference on Research and Development. 2007, 787-788

[14]	Lee D D, Seung H S. Algorithms for non-negative matrix factorization. In: Proceedings of 2000 Neural Information Processing Systems. 2000, 556-562

[15]	Hu X, Sun N, Zhang C, Chua T S. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In: Proceedings of 18th ACM Conference on Information and Knowledge Management. 2009, 919-928

[16]	Halkdi M, Nguyen B, Varlamis I, Vazirgiannis M. THESUS: organizing Web document collections based on link sematics. The VLDB Journal, 2003, 12(4): 320-332

[17]	Yoo I, Hu X, Song I Y. Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering. In: Proceedings of 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2006, 791-796

[18]	Gabrilovich E, Markovitch S. Feature generation for text categorization using world knowledge. In: Proceedings of 19th International Joint Conference on Artificial Intelligence. 2005, 1048-1053

[19]	Gabrilovich E, Markovitch S. Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of 21st National Conference on Artificial Intelligence, Vol 2. 2006, 1301-1306

[20]	Fodeh S, Punch B, Tan P N. On ontology-driven document clustering using core semantic features. Knowledge and Information Systems, 2011, 28(2): 395-421

[21]	Kasneci G, Ramanath M, Suchanek F, Weikum G. The YAGO-NAGA approach to knowledge discovery. ACM SIGMOD Record, 2008, 37(4): 41-47

[22]	Theobald M, Bast H, Majumdar D, Schenkel R, Weikum G. TopX: efficient and versatile top-k query processing for semistructured data. The VLDB Journal, 2008, 17(1): 81-115

RIGHTS & PERMISSIONS

Higher Education Press and Springer-Verlag Berlin Heidelberg

PDF (660KB)

1759

Accesses

Citation

Detail

Sections

Recommended

About the journal

Aims & scope

Description

Editorial board

Abstracting / indexing

Contact us

Browse

Just accepted

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Multimedia collections

Authors & reviewers

Online submission

Call for papers

Guidelines for authors

Download templates

Guidelines for reviewers

Abstract

Keywords

Cite this article

References

RIGHTS & PERMISSIONS