Clustering-based topical Web crawling using CFu-tree guided by link-context

Lu LIU; Tao PENG

doi:10.1007/s11704-014-3050-9

Front. Comput. Sci. ›› 2014, Vol. 8 ›› Issue (4) :581 -595. DOI: 10.1007/s11704-014-3050-9

RESEARCH ARTICLE

Clustering-based topical Web crawling using CFu-tree guided by link-context

Lu LIU ¹^,²
, Tao PENG ¹^,²^,^*

Author information +

History +

PDF (1169KB)

Abstract

Topical Web crawling is an established technique for domain-specific information retrieval. However, almost all the conventional topical Web crawlers focus on building crawlers using different classifiers, which needs a lot of labeled training data that is very difficult to label manually. This paper presents a novel approach called clustering-based topical Web crawling which is utilized to retrieve information on a specific domain based on link-context and does not require any labeled training data. In order to collect domain-specific content units, a novel hierarchical clustering method called bottom-up approach is used to illustrate the process of clustering where a new data structure, a linked list in combination with CFu-tree, is implemented to store cluster label, feature vector and content unit. During clustering, four metrics are presented. First, comparison variation (CV) is defined to judge whether the closest pair of clusters can be merged. Second, cluster impurity (CIP) evaluates the cluster error. Then, the precision and recall of clustering are also presented to evaluate the accuracy and comprehensive degree of the whole clustering process. Link-context extraction technique is used to expand the feature vector of anchor text which improves the clustering accuracy greatly. Experimental results show that the performance of our proposed method overcomes conventional focused Web crawlers both in Harvest rate and Target recall.

Keywords

topical Web crawling / comparison variation (CV) / cluster impurity (CIP) / CFu-tree / link-context / clustering

Cite this article

Download citation ▾

Lu LIU, Tao PENG. Clustering-based topical Web crawling using CFu-tree guided by link-context. Front. Comput. Sci., 2014, 8 (4) : 581-595 DOI:10.1007/s11704-014-3050-9

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Sun Y, Han J. Mining heterogeneous information networks: principles and methodologies. Morgan & Claypool Publishers, 2012

[2]	McCallum A, Nigam K. A comparison of event models for Naïve Bayes text classification. In: Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, 1998, 752: 41-48

[3]	Liu B, Hsu W, Ma Y. Integrating classification and association rule mining. In: Proceedings of Knowledge Discovery and Data Mining (KDD’98), 1998, 80-86

[4]	Boser B, Guyon I, Vapnik V. A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual Workshop on Computational Learning Theory, 1992, 5: 144-152

[5]	Chou C, Lee C, Chen Y. GA-based keyword selection for the design of an intelligent Web document search system. Computer Journal, 2009, 52(8): 890-901

[6]	Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. In: Proceedings of KDD Workshop on Text Mining, 2000, 400(1): 525-526

[7]	Jain A, Murty M, Flyn P. Data clustering: a review. ACM Computing Surveys, 1999, 31(3): 264-323

[8]	Fung B, Wang K, Ester M. Hierarchical document clustering using frequent itemsets. In: Proceedings of SIAM International Conference on Data Mining (SDM’03), 2003, 59-70

[9]	Fu T, Abbasi A, Chen H. A focused crawler for darkWeb forums. Journal of the American Society for Information Science and Technology, 2010, 61(6): 1213-1231

[10]	Li J, Furuse K, Yamaguchi K. Focused crawling by exploiting anchor text using decision tree. In: Proceedings of Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. New York: ACM, 2005, 1190-1191

[11]	Liu H, Milios E. Probabilistic models for focused Web crawling. Computational Intelligence, 2012, 28(3): 289-328

[12]	Hao H, Mu C, Yin X, Li S, Wang Z. An improved topic relevance algorithm for focused crawling. In: Proceedings of IEEE International Conference on Systems Man and Cyvernetics Conference, 2011, 850-855

[13]	Pant G, Srinivasan P. Link contexts in classifier-guided topical crawlers. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 107-122

[14]	Zhang H, Lu J. SCTWC: an online semi-supervised clustering approach to topical Web crawlers. Applied Soft Computing, 2010, 10(2): 490-495

[15]	Liu Y, Agah A. Topical crawling on the Web through local sitesearchers. Journal of Web Engineering, 2013, 12(3-4): 203-214

[16]	Rangrej A, Kulkarni S, Tendulkar A. Comparative study of clustering techniques for short text documents. In: Proceedings of the 20th International Conference Companion on World Wide Web. New York: ACM, 2011, 111-112

[17]	Wang X, Tang J, Liu H. Document clustering via matrix representation. In: Proceedings 2011 IEEE 11th International Conference on Data Mining, 2011, 804-813

[18]	Cota R, Ferreira A, Nascimento C, Goncalves M, Laender A. An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 2010, 61(9): 1853-1870

[19]	Spanakis G, Siolas G, Stafylopatis A. Exploiting wikipedia knowledge for conceptual hierarchical clustering of documents. Computer Journal, 2012, 55(3): 299-312

[20]	Bouras C, Tsogkas V. A clustering technique for news articles using WordNet. Knowledge-Based Systems, 2012, 36: 115-128

[21]	Li J, Zhao Y, Liu B. Exploiting semantic resources for large scale text categorization. Journal of Intelligent Information Systems, 2012, 39(3): 763-788

[22]	Trivedi A, Rai, P, Daume H, Duvall, S. Leveraging social bookmarks from partially tagged corpus for improved Web page clustering. ACM Transactions on Intelligent Systems and Technology, 2012, 3(4), Article 67

[23]	Wu M, Hawking D, Turpin A, Scholer F. Using anchor text for homepage and topic distillation search tasks. Journal of the American Society for Information Science and Technology, 2012, 63(6): 1234-1255

[24]	Hersovici M, Jacovi M, Maarek Y, Pellegb D, Shtalhaima M, Ura S. The shark-search algorithm. an application: tailored Web site mapping. Computer Networks and ISDN Systems, 1998, 30(1): 317-326

[25]	Chakrabarti S, Dom B, Gibson D, Kleinberg J, Raghavan P, Rajagopalan S. Automatic resource list compilation by analyzing hyperlink structure and associated text. Computer Networks and ISDN Systems, 1998, 30(1): 65-74

[26]	Pant G. Deriving link-context from HTML tag tree. In: Proceedings of 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003, 49-55

[27]	Qi G, Aggarwal C, Tian Q, Ji H, Huang T. Exploring context and content links in social media: a latent space method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(5): 850-862

[28]	Attardi G, Gullı’ A, Sebastiani F. Automatic Web page categorization by link and context analysis. In: Proceedings of the 1st European Symposium on Telematics, Hypermedia, and Artificial Intelligence, 1999, 99: 105-109

[29]	Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 1998, 30(1-7): 107-117

[30]	Pant G, Tsioutsiouliklis K, Johnson J, Giles C. Panorama: extending digital libraries with topical crawlers. In: Proceedings of 4th ACM/IEEE-CS Joint Conference Digital Libraries, 2004, 142-150

[31]	Peng T, Zhang C, Zuo W. Tunneling enhanced by Web page content bloc partition for focused crawling. Concurrency and Computation: Practice and Experience, 2008, 20(1): 61-74

[32]	Li J, Furuse K, Yamaguchi K. Focused crawling by exploiting anchor text using decision tree. In: Proceedings of 14th International Conference on World Wide Web, 2005, 1190-1191

[33]	Yu H, Zuo W, Peng T. A new PU learning algorithm for text classification. Lecture Notes in Computer Science, 2005, 3789: 824-832

RIGHTS & PERMISSIONS

Higher Education Press and Springer-Verlag Berlin Heidelberg

PDF (1169KB)

1472

Accesses

Citation

Detail

Sections

Recommended

About the journal

Aims & scope

Description

Editorial board

Abstracting / indexing

Contact us

Browse

Just accepted

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Multimedia collections

Authors & reviewers

Online submission

Call for papers

Guidelines for authors

Download templates

Guidelines for reviewers

Abstract

Keywords

Cite this article

References

RIGHTS & PERMISSIONS