Journal of Atmospheric and Environmental Optics ›› 2017, Vol. 12 ›› Issue (3): 230-235.

Previous Articles     Next Articles

Method of Web Page Text Extraction Based on Text Feature and Page Structure

HU Lulu1, LIU Xiaoqin1, SUN Kai2   

  1. (1 Key Laboratory of Atmospheric Composition and Optical Radiation, Anhui Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Hefei 230031, Anhui, China; 
    2 Department of Automation,University of Science and Technology of China, Hefei 230026, Anhui, China)
  • Received:2016-01-12 Revised:2016-02-17 Online:2017-05-28 Published:2017-05-18

Abstract:

Web information extraction technology has been a hot topic in the field of information technology. Moreover, in recent years, DIV + CSS page layout method was commonly used in web design. Based on this, a simple and practical method for the text extraction of news web pages based on text features and page structure is presented. The text content block on the page is identified and extracted firstly, and then regular expression is used to filter the HTML tag of content block and the main text of the web page is extracted. Experimental results show that the method has great universal property and accuracy rate in text extraction.

Key words: information extraction, text features, page structure, text content block, regular expressions

CLC Number: