基于正文特征和网页结构的网页正文抽取方法

大气与环境光学学报 ›› 2017, Vol. 12 ›› Issue (3): 230-235.

基于正文特征和网页结构的网页正文抽取方法

胡露露,刘小勤,孙凯

(1 中国科学院安徽光学精密机械研究所中科院大气成分与光学重点实验室，安徽合肥 230031；
2 中国科学技术大学自动化系，安徽合肥 230026)

收稿日期:2016-01-12 修回日期:2016-02-17 出版日期:2017-05-28 发布日期:2017-05-18
通讯作者: 胡露露( (1991-)，女，安徽涡阳人，硕士研究生，主要从事计算机应用方面的研究。 E-mail:hllyyy@mail.ustc.edu.cn
作者简介:胡露露( (1991-)，女，安徽涡阳人，硕士研究生，主要从事计算机应用方面的研究。
基金资助:
Supported by Strategic Priority Research program of the Chinese Academy of Sciences(中国科学院战略性先导科技专项, XDB05040300)

Method of Web Page Text Extraction Based on Text Feature and Page Structure

HU Lulu1, LIU Xiaoqin1, SUN Kai2

(1 Key Laboratory of Atmospheric Composition and Optical Radiation, Anhui Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Hefei 230031, Anhui, China;
2 Department of Automation，University of Science and Technology of China, Hefei 230026, Anhui, China)

Received:2016-01-12 Revised:2016-02-17 Published:2017-05-28 Online:2017-05-18

摘要/Abstract

摘要：

Web信息抽取技术一直是信息技术领域的研究热点。而且，近年来，DIV+CSS的网页布局方法开始普遍应用于网页设计中。基于此，提出了一种较为简单和实用的基于正文特征和网页结构的新闻网页正文抽取方法。首先识别和提取网页正文内容块，然后利用正则表达式滤除内容块中的HTML标记并提取网页正文。实验结果表明，该方法对正文抽取具有较高的通用性与准确率。

关键词: 信息抽取, 正文特征, 网页结构, 正文内容块, 正则表达式

Abstract:

Web information extraction technology has been a hot topic in the field of information technology. Moreover, in recent years, DIV + CSS page layout method was commonly used in web design. Based on this, a simple and practical method for the text extraction of news web pages based on text features and page structure is presented. The text content block on the page is identified and extracted firstly, and then regular expression is used to filter the HTML tag of content block and the main text of the web page is extracted. Experimental results show that the method has great universal property and accuracy rate in text extraction.

Key words: information extraction, text features, page structure, text content block, regular expressions

中图分类号:

TP391

胡露露刘小勤孙凯. 基于正文特征和网页结构的网页正文抽取方法[J]. 大气与环境光学学报, 2017, 12(3): 230-235.

HU Lu-Lu, LIU Xiao-Qin, SUN Kai. Method of Web Page Text Extraction Based on Text Feature and Page Structure[J]. Journal of Atmospheric and Environmental Optics, 2017, 12(3): 230-235.

参考文献

[1] Liu L, Pu C. XWRAP: an XML 2 enable wrapper constructionsystem for the Web information source [C] //Proceedingsof the 16th IEEE International Conference onData Engineering, 2000: 611-620.
[2] Ma Ling, Goharian N, Chowdhury A, et al. Extracting unstructured data from template generated Web documents [C] //In: Proceedings of the 12th International Conference on Information and Knowledge anagement, 2003: 512-515．
[3] Mei Xue, Cheng Xueqi, Guo Yan, et al. Fully automatic Wrapper generation for web information extraction [J]. Journal of Chinese Information Processing, 2008, 22(1): 22-29(in Chinese).
梅雪, 程学旗, 郭岩, 等. 一种全自动生成网页信息抽取Wrapper的方法 [J]. 中文信息学报, 2008, 22(1): 22-29.
[4] Sun Chengjie, Guan yi. A statistical approach for content extraction from web page [J]. Journal of Chinese Information Processing, 2004, 18(5): 17-22(in Chinese).
孙承杰, 关毅. 基于统计的网页正文信息抽取方法的研究 [J]. 中文信息学报, 2004, 18(5): 17-22.
[5] Sun Hao, Dong Shoubin. Adaptive approach for content extraction based on tag density [J]. Journal of Zhengzhou University, 2009, 41(1): 44-47(in Chinese).
孙皓, 董守斌. 基于标签密度的自适应正文提取方法 [J]. 郑州大学学报, 2009, 41(1): 44-47.
[6] An Zengwen, Wang Chao, Xu Jiefeng. An approach based on machine learning for information extraction method [J]. Microcomputer & Its Applications, 2010(12): 4-6(in Chinese).
安增文, 王超, 徐杰锋. 基于机器学习的网页正文提取方法 [J]. 微型机与应用, 2010(12): 4-6.
[7] You Guirong, Lu Yuchang. Extraction of topical information from Chinese web page based on the statistic and machine learning [J]. Journal of Fujian Commercial College, 2009, 4(2): 68-72(in Chinese).
游贵荣, 陆玉昌. 基于统计和机器学习的中文Web网页正文内容抽取 [J]. 福建商业高等专科学校学报, 2009, 4(2): 68-72.

[1]	王行芳金施群侯少阳. 基于小波阈值的激光探测声音信号去噪研究[J]. 大气与环境光学学报, 2018, 13(5): 388-394.
[2]	鲍翔于龙昆詹前靖吴毅. 发射与接收成像的Strehl比差异性研究[J]. 大气与环境光学学报, 2016, 11(4): 264-269.
[3]	梁远安, 易维宁, 黄红莲. 基于偏振信息融合的海洋背景目标检测[J]. 大气与环境光学学报, 2016, 11(1): 60-67.
[4]	程伟杨世植文奴崔生成. 基于小波变换的高分辨率遥感图像复原算法实现[J]. 大气与环境光学学报, 2015, 10(5): 401-407.
[5]	吴海滨周雨润周英蔚陈新兵项龙飞李梓霂. 基于二维OTSU选取种子点的区域生长图像分割[J]. 大气与环境光学学报, 2013, 8(6): 448-453.
[6]	吴海滨周后伟张铁译陈新兵周雨润王哲. 基于OTSU的动态结合全局阈值的图像分割[J]. 大气与环境光学学报, 2012, (6): 463-468.
[7]	张燕汪建业徐鹏. 大气相干长度仪全屏自动跟踪测量软件[J]. 大气与环境光学学报, 2011, 6(4): 305-310.
[8]	武鹏飞方帅徐青山饶瑞中. 基于大气MTF的退化图像复原[J]. 大气与环境光学学报, 2011, 6(3): 196-202.
[9]	麻金继陈浩. 基于ATCOR3模型的大气校正应用研究[J]. 大气与环境光学学报, 2009, 4(3): 211-216.
[10]	洪宇陆亦怀曾宗泳肖峰钢. 一种用于折射法大气温度梯度测量的光斑图像质心算法[J]. 大气与环境光学学报, 2009, 4(3): 206-210.
[11]	张继尧张渫刘晓易维宁 . 遥感图像自适应去噪方法研究[J]. 大气与环境光学学报, 2011, 6(5): 368-376.
[12]	吴海滨, 刘祥, 陈新兵, 庞剑, 李梓霂, 熊丹枫. 基于双目视觉的硅棒直径测量方法[J]. 大气与环境光学学报, 2014, 9(5): 401-408.
[13]	朱爱春张小波. MATLAB在水下激光线扫描图像处理中的应用[J]. 大气与环境光学学报, 2014, 9(5): 391-396.