Wrapper Definition

Wrappers are specialised program routines that automatically extract data from Internet websites and convert the information into a structured format. More specifically, wrappers have three main functions. Firstly, they must be able to download HTML pages from a website. Secondly, search for, recognise and extract specified data. Thirdly, save this data in a suitably structured format to enable further manipulation [6]. The data can then be imported into other applications for additional processing. According to [20], over 80% of the published information on the WWW is based on databases running in the background. When compiling this data into HTML documents the structure of the underlying databases is completely lost. Wrappers try to reverse this process by restoring the information to a structured format [21]. With the right programs, it is even possible to use the WWW as a large database. By using several wrappers to extract data from the various information sources of the WWW, the retrieved data can be made available in an appropriately structured format [4].

As a rule, a specially developed wrapper is required for each individual data source, because of the different and unique structures of websites. The WWW is also extremely dynamic and continually evolving, which results in frequent changes in the structures of websites. Consequently, it is often necessary to constantly update or even completely rewrite existing wrappers, in order to maintain the desired data extraction capabilities [1]. The Extensible Markup Language (XML) has the potential to alleviate such problems. Whereas HTML is presentation oriented, XML keeps the data structure separate from the presentation. However, it may take some time before all data is provided in the XML format, and it remains to be seen whether XML can establish itself in all areas of electronic information processing [11]. Taking into consideration that XML documents are based on varying Document Type Definitions (DTD) or XML-Schemas, the current problems regarding data extraction from HTML documents can be reduced, but not completely resolved. Wrappers will, therefore, retain an important role in the integration of data from WWW sources for some time to come.

Wrapper-Generating Toolkits
Every wrapper can be manually developed from scratch, for example, in an established programming language using regular expressions. For smaller applications, this can prove to be a sensible approach. However, if the use of a larger number of wrappers is required, this inevitably leads to the use of so-called toolkits, which can generate a complete wrapper based on user defined parameters for a given data source. One of the most important features of generated wrappers is the format in which the extracted data can be exported. If, for example, the extracted data is converted into an XML format, then it can be imported and processed by a large number

of software applications. Toolkits for generating wrappers can be differentiated in a number of ways. They can be categorised by their output methods, interface type, Web crawling capability, use of a graphical user interface (GUI) and several other characteristics. Laender et al.
[12] categorise a number of toolkits based on the methods used for generating wrappers. These methods include specially designed wrapper development languages and algorithms based on HTML-awareness, induction, modelling, ontology and natural language processing. However, a detailed presentation of such technical details is beyond the scope of this survey paper. Therefore, the toolkits are simply divided into two basic categories based on commercial and non-commercial availability.

The wrapper generating programs within both of these categories offer several different means of user interaction. Some toolkits are solely based on command lines and require routines developed in a pre-determined unique scripting language, in order to generate an appropriate wrapper for a specified data source. These wrapper development scripting languages are used in standard text editors and can be seen as application specific alternatives to general-purpose languages such as Perl and Java. A large number of toolkits offer a GUI, whereby the relevant data within an HTML document is highlighted with a mouse, and the program then generates a wrapper based on the specified information. Several toolkits combine both of the features described above. Initially, the relevant data is highlighted with a mouse and the program generates a wrapper from this input. If the automatically generated result does not meet the specified requirements, the user has the additional possibility of implementing changes via an editor integrated within the toolkit. Whether frequent corrections are necessary or not depends, largely, on the underlying algorithms and the functional maturity of the toolkit.

For more information, please visit our website: http://www.knowlesys.com 

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/14387573/viewspace-343213/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/14387573/viewspace-343213/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值