A focused crawling strategy based on comprehensive priority evaluation of hyperlinks and improved Bayesian classifier

Jingfa LIU , Yongchuang WU , Zhaoxia LIU

Eng Inform Technol Electron Eng ›› 2025, Vol. 26 ›› Issue (12) : 2569 -2582.

PDF (1195KB)
Eng Inform Technol Electron Eng ›› 2025, Vol. 26 ›› Issue (12) :2569 -2582. DOI: 10.1631/FITEE.2400939
Research Article

A focused crawling strategy based on comprehensive priority evaluation of hyperlinks and improved Bayesian classifier

Author information +
History +
PDF (1195KB)

Abstract

Avoidance of topic drift and enabling crossing tunnels are two main difficulties in focused crawling. To overcome the problem of topic drift, we design a comprehensive priority evaluation (CPE) method based on the web text, anchor text, and context of hyperlinks, which improves the topic-relevance evaluation of unvisited hyperlinks. Subsequently, we propose an improved Bayesian classifier with weights (BCW), which adds label weights to the feature words of the Bayesian classifier to enhance the accuracy of webpage classification. To cross tunnels through which some topic-relevant webpages can be reached from low-relevance webpages, we construct a content block segmentation (CBS) technology for webpages based on the backtracking method, which segments a webpage into multiple blocks and then judges the relevance of every content block, extracting hyperlinks with high comprehensive relevance. Finally, a BCW-based focused crawling strategy combining the CPE and CBS strategies (BCW_CC) is proposed and experimentally evaluated for focused crawling in two domains: rainstorm disasters and sports. The results demonstrate the effectiveness of the developed BCW_CC method.

Keywords

Focused crawler (FC) / Bayesian classifier / Information retrieval / Priority evaluation

Cite this article

Download citation ▾
Jingfa LIU, Yongchuang WU, Zhaoxia LIU. A focused crawling strategy based on comprehensive priority evaluation of hyperlinks and improved Bayesian classifier. Eng Inform Technol Electron Eng, 2025, 26(12): 2569-2582 DOI:10.1631/FITEE.2400939

登录浏览全文

4963

注册一个新账户 忘记密码

References

RIGHTS & PERMISSIONS

Zhejiang University Press

AI Summary AI Mindmap
PDF (1195KB)

85

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/