Clustering-based topical Web crawling using CFu-tree guided by link-context
Lu LIU, Tao PENG
Clustering-based topical Web crawling using CFu-tree guided by link-context
Topical Web crawling is an established technique for domain-specific information retrieval. However, almost all the conventional topical Web crawlers focus on building crawlers using different classifiers, which needs a lot of labeled training data that is very difficult to label manually. This paper presents a novel approach called clustering-based topical Web crawling which is utilized to retrieve information on a specific domain based on link-context and does not require any labeled training data. In order to collect domain-specific content units, a novel hierarchical clustering method called bottom-up approach is used to illustrate the process of clustering where a new data structure, a linked list in combination with CFu-tree, is implemented to store cluster label, feature vector and content unit. During clustering, four metrics are presented. First, comparison variation (CV) is defined to judge whether the closest pair of clusters can be merged. Second, cluster impurity (CIP) evaluates the cluster error. Then, the precision and recall of clustering are also presented to evaluate the accuracy and comprehensive degree of the whole clustering process. Link-context extraction technique is used to expand the feature vector of anchor text which improves the clustering accuracy greatly. Experimental results show that the performance of our proposed method overcomes conventional focused Web crawlers both in Harvest rate and Target recall.
topical Web crawling / comparison variation (CV) / cluster impurity (CIP) / CFu-tree / link-context / clustering
[1] |
Sun Y, Han J. Mining heterogeneous information networks: principles and methodologies. Morgan & Claypool Publishers, 2012
|
[2] |
McCallum A, Nigam K. A comparison of event models for Naïve Bayes text classification. In: Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, 1998, 752: 41-48
|
[3] |
Liu B, Hsu W, Ma Y. Integrating classification and association rule mining. In: Proceedings of Knowledge Discovery and Data Mining (KDD’98), 1998, 80-86
|
[4] |
Boser B, Guyon I, Vapnik V. A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual Workshop on Computational Learning Theory, 1992, 5: 144-152
|
[5] |
Chou C, Lee C, Chen Y. GA-based keyword selection for the design of an intelligent Web document search system. Computer Journal, 2009, 52(8): 890-901
CrossRef
Google scholar
|
[6] |
Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. In: Proceedings of KDD Workshop on Text Mining, 2000, 400(1): 525-526
|
[7] |
Jain A, Murty M, Flyn P. Data clustering: a review. ACM Computing Surveys, 1999, 31(3): 264-323
CrossRef
Google scholar
|
[8] |
Fung B, Wang K, Ester M. Hierarchical document clustering using frequent itemsets. In: Proceedings of SIAM International Conference on Data Mining (SDM’03), 2003, 59-70
|
[9] |
Fu T, Abbasi A, Chen H. A focused crawler for darkWeb forums. Journal of the American Society for Information Science and Technology, 2010, 61(6): 1213-1231
|
[10] |
Li J, Furuse K, Yamaguchi K. Focused crawling by exploiting anchor text using decision tree. In: Proceedings of Special Interest Tracks and Posters of the 14th International Conference on World Wide Web. New York: ACM, 2005, 1190-1191
|
[11] |
Liu H, Milios E. Probabilistic models for focused Web crawling. Computational Intelligence, 2012, 28(3): 289-328
CrossRef
Google scholar
|
[12] |
Hao H, Mu C, Yin X, Li S, Wang Z. An improved topic relevance algorithm for focused crawling. In: Proceedings of IEEE International Conference on Systems Man and Cyvernetics Conference, 2011, 850-855
|
[13] |
Pant G, Srinivasan P. Link contexts in classifier-guided topical crawlers. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1): 107-122
CrossRef
Google scholar
|
[14] |
Zhang H, Lu J. SCTWC: an online semi-supervised clustering approach to topical Web crawlers. Applied Soft Computing, 2010, 10(2): 490-495
CrossRef
Google scholar
|
[15] |
Liu Y, Agah A. Topical crawling on the Web through local sitesearchers. Journal of Web Engineering, 2013, 12(3-4): 203-214
|
[16] |
Rangrej A, Kulkarni S, Tendulkar A. Comparative study of clustering techniques for short text documents. In: Proceedings of the 20th International Conference Companion on World Wide Web. New York: ACM, 2011, 111-112
|
[17] |
Wang X, Tang J, Liu H. Document clustering via matrix representation. In: Proceedings 2011 IEEE 11th International Conference on Data Mining, 2011, 804-813
CrossRef
Google scholar
|
[18] |
Cota R, Ferreira A, Nascimento C, Goncalves M, Laender A. An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 2010, 61(9): 1853-1870
CrossRef
Google scholar
|
[19] |
Spanakis G, Siolas G, Stafylopatis A. Exploiting wikipedia knowledge for conceptual hierarchical clustering of documents. Computer Journal, 2012, 55(3): 299-312
CrossRef
Google scholar
|
[20] |
Bouras C, Tsogkas V. A clustering technique for news articles using WordNet. Knowledge-Based Systems, 2012, 36: 115-128
CrossRef
Google scholar
|
[21] |
Li J, Zhao Y, Liu B. Exploiting semantic resources for large scale text categorization. Journal of Intelligent Information Systems, 2012, 39(3): 763-788
CrossRef
Google scholar
|
[22] |
Trivedi A, Rai, P, Daume H, Duvall, S. Leveraging social bookmarks from partially tagged corpus for improved Web page clustering. ACM Transactions on Intelligent Systems and Technology, 2012, 3(4), Article 67
|
[23] |
Wu M, Hawking D, Turpin A, Scholer F. Using anchor text for homepage and topic distillation search tasks. Journal of the American Society for Information Science and Technology, 2012, 63(6): 1234-1255
CrossRef
Google scholar
|
[24] |
Hersovici M, Jacovi M, Maarek Y, Pellegb D, Shtalhaima M, Ura S. The shark-search algorithm. an application: tailored Web site mapping. Computer Networks and ISDN Systems, 1998, 30(1): 317-326
CrossRef
Google scholar
|
[25] |
Chakrabarti S, Dom B, Gibson D, Kleinberg J, Raghavan P, Rajagopalan S. Automatic resource list compilation by analyzing hyperlink structure and associated text. Computer Networks and ISDN Systems, 1998, 30(1): 65-74
CrossRef
Google scholar
|
[26] |
Pant G. Deriving link-context from HTML tag tree. In: Proceedings of 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2003, 49-55
CrossRef
Google scholar
|
[27] |
Qi G, Aggarwal C, Tian Q, Ji H, Huang T. Exploring context and content links in social media: a latent space method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(5): 850-862
CrossRef
Google scholar
|
[28] |
Attardi G, Gullı’ A, Sebastiani F. Automatic Web page categorization by link and context analysis. In: Proceedings of the 1st European Symposium on Telematics, Hypermedia, and Artificial Intelligence, 1999, 99: 105-109
|
[29] |
Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 1998, 30(1-7): 107-117
|
[30] |
Pant G, Tsioutsiouliklis K, Johnson J, Giles C. Panorama: extending digital libraries with topical crawlers. In: Proceedings of 4th ACM/IEEE-CS Joint Conference Digital Libraries, 2004, 142-150
|
[31] |
Peng T, Zhang C, Zuo W. Tunneling enhanced by Web page content bloc partition for focused crawling. Concurrency and Computation: Practice and Experience, 2008, 20(1): 61-74
CrossRef
Google scholar
|
[32] |
Li J, Furuse K, Yamaguchi K. Focused crawling by exploiting anchor text using decision tree. In: Proceedings of 14th International Conference on World Wide Web, 2005, 1190-1191
|
[33] |
Yu H, Zuo W, Peng T. A new PU learning algorithm for text classification. Lecture Notes in Computer Science, 2005, 3789: 824-832
CrossRef
Google scholar
|
/
〈 | 〉 |