Method of Web Page Text Extraction Based on Text Feature and Page Structure

Journal of Atmospheric and Environmental Optics ›› 2017, Vol. 12 ›› Issue (3): 230-235.

Previous Articles Next Articles

Method of Web Page Text Extraction Based on Text Feature and Page Structure

HU Lulu1, LIU Xiaoqin1, SUN Kai2

(1 Key Laboratory of Atmospheric Composition and Optical Radiation, Anhui Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Hefei 230031, Anhui, China;
2 Department of Automation，University of Science and Technology of China, Hefei 230026, Anhui, China)

Received:2016-01-12 Revised:2016-02-17 Online:2017-05-28 Published:2017-05-18

Abstract

Abstract:

Web information extraction technology has been a hot topic in the field of information technology. Moreover, in recent years, DIV + CSS page layout method was commonly used in web design. Based on this, a simple and practical method for the text extraction of news web pages based on text features and page structure is presented. The text content block on the page is identified and extracted firstly, and then regular expression is used to filter the HTML tag of content block and the main text of the web page is extracted. Experimental results show that the method has great universal property and accuracy rate in text extraction.

Key words: information extraction, text features, page structure, text content block, regular expressions

CLC Number:

TP391

HU Lu-Lu, LIU Xiao-Qin, SUN Kai. Method of Web Page Text Extraction Based on Text Feature and Page Structure[J]. Journal of Atmospheric and Environmental Optics, 2017, 12(3): 230-235.

References

[1] Liu L, Pu C. XWRAP: an XML 2 enable wrapper constructionsystem for the Web information source [C] //Proceedingsof the 16th IEEE International Conference onData Engineering, 2000: 611-620.
[2] Ma Ling, Goharian N, Chowdhury A, et al. Extracting unstructured data from template generated Web documents [C] //In: Proceedings of the 12th International Conference on Information and Knowledge anagement, 2003: 512-515．
[3] Mei Xue, Cheng Xueqi, Guo Yan, et al. Fully automatic Wrapper generation for web information extraction [J]. Journal of Chinese Information Processing, 2008, 22(1): 22-29(in Chinese).
梅雪, 程学旗, 郭岩, 等. 一种全自动生成网页信息抽取Wrapper的方法 [J]. 中文信息学报, 2008, 22(1): 22-29.
[4] Sun Chengjie, Guan yi. A statistical approach for content extraction from web page [J]. Journal of Chinese Information Processing, 2004, 18(5): 17-22(in Chinese).
孙承杰, 关毅. 基于统计的网页正文信息抽取方法的研究 [J]. 中文信息学报, 2004, 18(5): 17-22.
[5] Sun Hao, Dong Shoubin. Adaptive approach for content extraction based on tag density [J]. Journal of Zhengzhou University, 2009, 41(1): 44-47(in Chinese).
孙皓, 董守斌. 基于标签密度的自适应正文提取方法 [J]. 郑州大学学报, 2009, 41(1): 44-47.
[6] An Zengwen, Wang Chao, Xu Jiefeng. An approach based on machine learning for information extraction method [J]. Microcomputer & Its Applications, 2010(12): 4-6(in Chinese).
安增文, 王超, 徐杰锋. 基于机器学习的网页正文提取方法 [J]. 微型机与应用, 2010(12): 4-6.
[7] You Guirong, Lu Yuchang. Extraction of topical information from Chinese web page based on the statistic and machine learning [J]. Journal of Fujian Commercial College, 2009, 4(2): 68-72(in Chinese).
游贵荣, 陆玉昌. 基于统计和机器学习的中文Web网页正文内容抽取 [J]. 福建商业高等专科学校学报, 2009, 4(2): 68-72.

Method of Web Page Text Extraction Based on Text Feature and Page Structure

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 1

Recommended Articles

Metrics

Comments