Screen scraping

最新推荐文章于 2024-10-04 06:11:34 发布

weixin_30449239

最新推荐文章于 2024-10-04 06:11:34 发布

阅读量199

点赞数

文章标签： xhtml

原文链接：http://www.cnblogs.com/faunus/archive/2009/04/11/1433564.html

版权

From Wikipedia, the free encyclopedia

http://en.wikipedia.org/wiki/Screen_scraping

Screen scraping is a technique in which a computer program extracts data from the display output of another program. The program doing the scraping is called a screen scraper. The key element that distinguishes screen scraping from regular parsing is that the output being scraped was intended for display to an end-user, rather than as input to another program, and is therefore usually neither documented nor structured for convenient parsing. Screen scraping often involves ignoring binary data (usually images or multimedia data) and formatting elements that obscure the essential, desired text data. Optical character recognition (OCR) software is a kind of visual scraper.

Description

Normally, data transfer between programs is accomplished using data structures suited for automated processing by computers, not people. Such interchange formats and protocols are typically rigidly structured, well-documented, easily parsed, and keep ambiguity to a minimum. Very often, these transmissions are not human-readable at all.

In contrast, output intended to be human-readable is often the antithesis of this, with display formatting, redundant labels, superfluous commentary, and other information which is either irrelevant or inimical to automated processing. However, when the only output available is such a human-oriented display, screen scraping becomes the only automated way of accomplishing a data transfer.

Screen scraping is most often done to either (1) interface to a legacy system which has no other mechanism which is compatible with current hardware, or (2) interface to a third-party system which does not provide a more convenient API. In the second case, the operator of the third-party system may even see screen scraping as unwanted, due to reasons such as increased system load, the loss of advertisement revenue, or the loss of control of the information content.

Screen scraping is generally considered an ad-hoc, inelegant technique, often used only as a "last resort" when no other mechanism is available. Aside from the higher programming and processing overhead, output displays intended for human consumption often change structure frequently. Humans can cope with this easily, but computer programs will often crash or produce incorrect results.

[edit] Etymology

Originally, screen scraping referred to the practice of reading text data from a computer display terminal's screen. This was generally done by reading the terminal's memory through its auxiliary port, or by connecting the terminal output port of one computer system to an input port on another.

As a concrete example of a classic screen scraper, consider a hypothetical legacy system dating from the 1960s -- the dawn of computerized data processing. Computer to user interfaces from that era were often simply text-based dumb terminals which were not much more than virtual teleprinters. (Such systems are still in use today, for various reasons.) The desire to interface such a system to more modern systems is common. An elegant solution will often require things no longer available, such as source code, system documentation, APIs, and/or programmers with experience in a 45 year old computer system. In such cases, the only feasible solution may be to write a screen scraper which "pretends" to be a user at a terminal. The screen scraper might connect to the legacy system via Telnet, emulate the keystrokes needed to navigate the old user interface, process the resulting display output, extract the desired data, and pass it on to the modern system.

In the 1980's financial data providers such as Reuters, Telerate, and Quotron displayed data in 24x80 format intended for a human reader. Users of this data particularly investment banks wrote applications to capture and convert this character data as numeric data for inclusion into calculations for trading decisions without re-keying the data. The common term for this practice, especially in the United Kingdom, was page shredding, since the results could be imagined to have passed through a paper shredder.

[edit] Web scraping

Main article: Web scraping

Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper bears little in common with a screen scraper, and is nothing more than an API to extract data from a web site.