Extracting Structured Data from Web Pages

最新推荐文章于 2024-09-08 10:40:56 发布

cnugnu1522

最新推荐文章于 2024-09-08 10:40:56 发布

阅读量95

点赞数

文章标签：数据库

Keywords: Automatic Data Extraction

Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from the web pages without any learning examples or other similar human input. We formally define the notion of a template, and propose a model that describes how values are encoded into pages using a template. We present an extraction algorithm that uses sets of words that have similar occurrence pattern in the input pages, to construct the template. The constructed template is then used to extract values from the pages. We show experimentally that the extracted values make semantic sense in most cases.

For more information, please visit our website: http://www.knowlesys.com

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/14387573/viewspace-343224/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/14387573/viewspace-343224/

cnugnu1522

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Extracting Structured Data from Web Pages

Keywords: Automatic Data Extraction Many web sites contain large sets of page...
复制链接

扫一扫