Relevance-based content extraction of HTML documents

Qi Wu , Xing-shu Chen , Kai Zhu , Chun-hui Wang

Journal of Central South University ›› 2012, Vol. 19 ›› Issue (7) : 1921 -1926.

PDF
Journal of Central South University ›› 2012, Vol. 19 ›› Issue (7) : 1921 -1926. DOI: 10.1007/s11771-012-1226-8
Article

Relevance-based content extraction of HTML documents

Author information +
History +
PDF

Abstract

Content extraction of HTML pages is the basis of the web page clustering and information retrieval, so it is necessary to eliminate cluttered information and very important to extract content of pages accurately. A novel and accurate solution for extracting content of HTML pages was proposed. First of all, the HTML page is parsed into DOM object and the IDs of all leaf nodes are generated. Secondly, the score of each leaf node is calculated and the score is adjusted according to the relationship with neighbors. Finally, the information blocks are found according to the definition, and a universal classification algorithm is used to identify the content blocks. The experimental results show that the algorithm can extract content effectively and accurately, and the recall rate and precision are 96.5% and 93.8%, respectively.

Keywords

content extraction / DOM / node / relevance / information block

Cite this article

Download citation ▾
Qi Wu, Xing-shu Chen, Kai Zhu, Chun-hui Wang. Relevance-based content extraction of HTML documents. Journal of Central South University, 2012, 19(7): 1921-1926 DOI:10.1007/s11771-012-1226-8

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

OuJ. W., DongX. B., CaiB.. Topic information extraction from template web pages [J]. Journal of Tsinghua University: Science and Technology, 2005, 45(S1): 1743-1747

[2]

SandipD., PrasenjitM., C LeeG.. Identifying content blocks from web documents [C]. 2005 International Symposium on Methodologies for Intelligent Systems (ISMIS 2005), 2005New YorkLNAI285-293

[3]

MohsenA., MirM. P., AmirM. R.. Main content extraction from detailed web pages [J]. International Journal of Computer Applications, 2010, 4(11): 18-21

[4]

YiL., LiuB., LiX. L.. Eliminating noisy information in web pages for data mining [C]. The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003WashingtonACM Press296-305

[5]

SuhitG., HilaB., GailK., SalvatoreS.. Verifying genre-based clustering approach to content extraction [C]. The 15th International World Wide Web Conference, 2006BudapestACM Press875-876

[6]

DebnathS.. Automatic identification of informative sections of web pages [J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(9): 1233-1246

[7]

GottronT.. Combining content extraction heuristics: the combined system [C]. The 10th International Conference on Information Integration and Web-based Application & Services, 2008New YorkACM Press591-594

[8]

GottronT.. An evolutionary approach to automatically optimize web content extraction [C]. The Joint Venture of the 17th International Conference Intelligent Information System (IIS) and the 24th International Conference on Artificial Intelligence (AI), 2009KrakowThe IEEE Computational Intelligence Society331-341

[9]

JavierA. M., KoenD., MarieF. M.. Language independent content extraction from web pages [C]. The 9th Dutch-Belgian Information Retrieval Workshop, 2009NetherlandUniversity of Twente50-55

[10]

TimW., WilliamH. H.. Web content extraction through histogram clustering [C]. The 18th International Conference on Artificial Neural Networks in Engineering (ANNIE 2008), 2008St. LouisLecture Notes in Computer Science124-132

[11]

ThomasG.. Content code blurring: A new approach to content extraction [C]. The 2008 19th International Conference on Database and Expert Systems Application, 2008WashingtonIEEE Computer Society29-33

[12]

BingL. D., WangY. X., ZhangY.. Primary content extraction with mountain model [C]. The 8th IEEE International Conference on Computer and Information Technology, 2008SydneyIEEE Press479-484

[13]

W3C. Document object model [EB/OL]. [2011-3-5]. http://www.w3.org/DOM/.

[14]

MacqueenJ.. Some methods for classification and analysis of multivariate observations [C]. The 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967BerkeleyBerkeley Press281-297

[15]

Computer Networks and Distributed System Laboratory, Peking University. CWIRF [EB/OL]. [2011-3-8]. http://www.cwirf.org/.

AI Summary AI Mindmap
PDF

133

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/