A focused crawling strategy based on comprehensive priority evaluation of hyperlinks and improved Bayesian classifier
Jingfa LIU , Yongchuang WU , Zhaoxia LIU
Eng Inform Technol Electron Eng ›› 2025, Vol. 26 ›› Issue (12) : 2569 -2582.
A focused crawling strategy based on comprehensive priority evaluation of hyperlinks and improved Bayesian classifier
Avoidance of topic drift and enabling crossing tunnels are two main difficulties in focused crawling. To overcome the problem of topic drift, we design a comprehensive priority evaluation (CPE) method based on the web text, anchor text, and context of hyperlinks, which improves the topic-relevance evaluation of unvisited hyperlinks. Subsequently, we propose an improved Bayesian classifier with weights (BCW), which adds label weights to the feature words of the Bayesian classifier to enhance the accuracy of webpage classification. To cross tunnels through which some topic-relevant webpages can be reached from low-relevance webpages, we construct a content block segmentation (CBS) technology for webpages based on the backtracking method, which segments a webpage into multiple blocks and then judges the relevance of every content block, extracting hyperlinks with high comprehensive relevance. Finally, a BCW-based focused crawling strategy combining the CPE and CBS strategies (BCW_CC) is proposed and experimentally evaluated for focused crawling in two domains: rainstorm disasters and sports. The results demonstrate the effectiveness of the developed BCW_CC method.
Focused crawler (FC) / Bayesian classifier / Information retrieval / Priority evaluation
Zhejiang University Press
/
| 〈 |
|
〉 |