一种高性能分布式Web爬行器的设计与实现(二)_如何设计简易的网络爬行器-CSDN博客

原文：

1.1 Crawling Applications

There are a number of different scenarios in which crawlers are used for data acquisition. We now describe a few examples and how they differ in the crawling strategies used.

Breadth-First Crawler: In order to build a major search engine or a large repository such as the Internet Archive [18], high-performance crawlers start out at a small set of pages and then explore other pages by following links in a “breadth first-like” fashion. In reality, the web pages are often not traversed in a strict breadth-first fashion, but using a variety of policies, e.g., for pruning crawls inside a web site, or for crawling more important pages first¹.

Recrawling Pages for Updates: After pages are initially acquired, they may have to be periodically recrawled and checked for updates. In the simplest case, this could be done by starting another broad breadth-first crawl, or by simply requesting all URLs in the collection again. However, a variety of heuristics can be employed to recrawl more important pages, sites, or domains more frequently. Good recrawling strategies are crucial for maintaining an up-to-date search index with limited crawling bandwidth, and recent work by Cho and Garcia-Molina [11, 10] has studied techniques for optimizing the “freshness” of such collections given observations about a page’s update history.

Focused Crawling: More specialized search engines may use crawling policies that attempt to focus only on certain types of pages, e.g., pages on a particular topic or in a particular language, images, mp3 files, or computer science research papers. In addition to heuristics, more general approaches have been proposed based on link structure analysis [9, 8] and machine learning techniques [13, 23]. The goal of a focused crawler is to find many pages of interest without using a lot of bandwidth. Thus, most of the previous work does not use a high-performance crawler, although doing so could support large specialized collections that are significantly more up-to-date than a broad search engine.

Random Walking and Sampling: Several techniques have been studied that use random walks on the web graph (or a slightly modified graph) to sample pages or estimate the size and quality of search engines [3, 15, 14].

Crawling the “Hidden Web”: A lot of the data accessible via the web actually resides in databases and can only be retrieved by posting appropriate queries and/or filling out forms on web pages. Recently, a lot of interest has focused on automatic access to this data, also called the “Hidden Web”, “Deep Web”, or “Federated Facts and Figures”. Work in [22] has looked at techniques for crawling this data. A crawler such as the one described here could be extended and used as an efficient front-end for such a system. We note, however, that there are many other challenges associated with access to the hidden web, and the efficiency of the front end is probably not the most important issue.

1.2 Basic Crawler Structure

Given these scenarios, we would like to design a flexible system that can be adapted to different applications and strategies with a reasonable amount of work. Note that there are significant differences between the scenarios. For example, a broad breadth-first crawler has to keep track of which pages have been crawled already; this is commonly done using a “URL seen” data structure that may have to reside on disk for large crawls. A link analysis-based focused crawler, on the other hand, may use an additional data structure to represent the graph structure of the crawled part of the web, and a classifier to judge the relevance of a page [9, 8], but the size of the structures may be much smaller. On the other hand, there are a number of common tasks that need to be done in all or most scenarios, such as enforcement of robot exclusion, crawl speed control, or DNS resolution.

For simplicity, we separate our crawler design into two main components, referred to as crawling application and crawling system. The crawling application decides what page to request next given the current state and the previously crawled pages, and issues a stream of requests (URLs) to the crawling system. The crawling system (eventually) downloads the requested pages and supplies them to the crawling application for analysis and storage. The crawling system is in charge of tasks such as robot exclusion, speed control, and DNS resolution that are common to most scenarios, while the application implements crawling strategies such as “breadth-first” or ”focused“. Thus, to implement a focused crawler instead of a breadth-first crawler, we would use the same crawling system (with a few different parameter settings) but a significantly different application component, written using a library of functions for common tasks such as parsing, maintenance of the “URL seen” structure, and communication with crawling system and storage.
At first glance, implementation of the crawling system may appear trivial. This is however not true in the high-performance case, where several hundred or even a thousand pages have to be downloaded per second. In fact, our crawling system consists itself of several components that can be replicated for higher performance. Both crawling system and application can also be replicated independently, and several different applications could issue requests to the same crawling system, showing another motivation for the design². More details on the architecture are given in Section 2.We note here that this partition into application and system components is a design choice in our system, and not used by some other systems³ , and that it is not always obvious in which component a particular task should be handled. For the work in this paper, we focus on the case of a broad breadth-first crawler as our crawling application.

1.3 Requirements for a Crawler

We now discuss the requirements for a good crawler, and approaches for achieving them. Details on our solutions are given in the subsequent sections.

Flexibility: As mentioned, we would like to be able to use the system in a variety of scenarios, with as few modifications as possible.

Low Cost and High Performance: The system should scale to at least several hundred pages per second and hundreds of millions of pages per run, and should run on low-cost hardware. Note that efficient use of disk access is crucial to maintain a high speed after the main data structures, such as the “URL seen” structure and crawl frontier, become too large for main memory. This will only happen after downloading several million pages.

Robustness: There are several aspects here. First, since the system will interact with millions of servers, it has to tolerate bad HTML, strange server behavior and configurations, and many other odd issues. Our goal here is to err on the side of caution, and if necessary ignore pages and even entire servers with odd behavior, since in many applications we can only download a subset of the pages anyway. Secondly, since a crawl may take weeks or months, the system needs to be able to tolerate crashes and network interruptions without losing (too much of) the data. Thus, the state of the system needs to be kept on disk. We note that we do not really require strict ACID properties. Instead, we decided to periodically synchronize the main structures to disk, and to recrawl a limited number of pages after a crash.

Etiquette and Speed Control: It is extremely important to follow the standard conventions for robot exclusion (robots.txt and robots meta tags), to supply a contact URL for the crawler, and to supervise the crawl. In addition, we need to be able to control access speed in several different ways. We have to avoid putting too much load on a single server; we do this by contacting each site only once every 30 seconds unless specified otherwise. It is also desirable to throttle the speed on a domain level, in order not to overload small domains, and for other reasons to be explained later. Finally, since we are in a campus environment where our connection is shared with many other users, we also need to control the total download rate of our crawler. In particular, we crawl at low speed during the peek usage hours of the day, and at a much higher speed during the late night and early morning, limited mainly by the load tolerated by our main campus router.

Manageability and Reconfigurability: An appropriate interface is needed to monitor the crawl, including the speed of the crawler, statistics about hosts and pages, and the sizes of the main data sets. The administrator should be able to adjust the speed, add and remove components, shut down the system, force a checkpoint, or add hosts and domains to a “blacklist” of places that the crawler should avoid. After a crash or shutdown, the software of the system may be modified to fix problems, and we may want to continue the crawl using a different machine configuration. In fact, the software at the end of our first huge crawl was significantly different from that at the start, due to the need for numerous fixes and extensions that became only apparent after tens of millions of pages had been downloaded.

1.4 Content of this Paper

In this paper, we describe the design and implementation of a distributed web crawler that runs on a network of workstations. The crawler scales to (at least) several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications. We present the software architecture of the system, discuss the performance bottlenecks, and describe efficient techniques for achieving high performance. We also report experimental results based on a crawl of 120 million pages on 5 million hosts.

The remainder of this paper is organized as follows. Section 2 describes the architecture of our system and its major components, and Section 3 describes the data structures and algorithmic techniques that were used in more detail. Section 4 presents preliminary experimental results. Section 5 compares our design to that of other systems we know of. Finally, Section 6 offers some concluding remarks.

译文：

1.1 爬行应用程序

爬行器可以采用很多不同的方案采集数据。现在我们介绍一些例子并说明他们所采取的爬行策略有哪些不同之处。

宽度优先爬行器：为了构造一个大型搜索引擎或者像Internet Archive那样的大型储存库【18】，高性能爬行器从一个小的网页集合出发，以“宽度优先“的方式跟踪超链接来寻找其他网页。实际上，网页通常不是以严格的宽度优先的方式遍历的，而是使用了多种规则，例如在Web站点内进行修剪爬行的规则，优先爬行更重要网页的规则¹。

重爬网页用以更新：网页被首次抓取以后，可能需要周期性地重爬并检查更新。在最简单的例子中，这可以通过启动另一个宽度优先爬行任务或者简单地重新请求所有搜集的URL来完成。然而，可以用多种搜索法更频繁地重爬比较重要的网页、站点或者域。优良的重爬策略对于维护具有有限爬行带宽而需要保持更新的索引来说是至关重要的，对于指定了网页更新历史检查点的集合中的新网页，Cho和Garcia-Molina最近的工作【11，10】已经对优化这些网页所采取的技术做了研究。

聚焦式爬行：更专业的搜索引擎可能只关注特定类型的网页，比如特定主题或语言的网页，图像，MP3文件，或者计算机科学研究论文。除了启发式算法，更一般的方法是基于链接结构的分析方法【9，8】和机器学习技术【13，23】。聚焦式爬行器的目标是不需要很大带宽就能定位所关心的大量网页。因此，以前的大部分爬行工作不使用高性能爬行器，尽管这样做可以提供比普通的搜索引擎更实时的大型专业网页集合。

随机游走和采样：现在已经研究出了几项技术，这些技术利用在网络图（或者稍微改进的网络图）上随机游走来采集网页或评估搜索引擎的规模和质量【3，15，14】。

爬行“隐形网”：通过网络可接触的很多数据实际上储存在数据库中，而且只有通过提交适当的查询请求和（或者）在网页上填写表格才能检索到。最近很多人关注对这类数据（也叫隐形网、深层网或联合事实和数字）的自动访问。【22】对爬行这类数据的技术做了探讨。像这样一类爬行器可以被扩展并用作上述系统的有效前端。然而需要注意的是，访问隐形网还有很多待解决的问题，前端的效率很可能并不是最重要的问题。

1.2 爬行器的基本结构

给出了上述一些方案，我们将设计一个灵活的系统，它可以适用于具有合理工作量的不同应用程序和爬行策略。注意，这些方案之间存在显著的差异。例如，宽度优先爬行器必须记住已经爬过的网页，这通常利用一个叫“URL seen”的数据结构来完成，对于大型爬行任务，这个数据结构可能需要存储在磁盘上。然而，基于超链分析的聚焦式爬行器可能需要一个辅助的数据结构来描述网络中已经爬过部分的图状结构，并且利用一个分类器来判断网页的相关性【9，8】，但是这个数据结构的规模可能很小。但是，所有或者大部分方案都需要完成一些公共任务，比如执行机器人排除协议，爬行速度控制和域名解析。

为简单起见，我们把爬行器设计成两个主要组件，即爬行应用程序和爬行系统。爬行应用程序根据当前的状态以及已经爬过的网页决定下一个要请求的网页，并向爬行系统发出请求（URLs）流。爬行系统下载被请求的网页并将它们交给爬行应用程序进行分析和存储。爬行系统处理大部分方案中的公共任务，比如机器人排除协议、速度控制和域名解析，而应用程序则实现爬行策略，比如宽度优先策略或聚焦式策略。因此，要实现一个聚焦式爬行器而非宽度优先爬行器，我们会使用相同的爬行系统（只是参数设定有些不同），但是使用很不相同的应用程序，应用程序用一些处理公共任务的函数库写成，这些公共任务包括语法分析、“URL seen”结构维护、与爬行系统通信和存储等。

乍一看，爬行系统的实现似乎很简单。然而，在高性能（即每秒需要下载几百甚至上千个网页）的情况下却不然。事实上，爬行系统由若干组件构成，他们可以被复制来实现高性能。爬行系统和应用程序都可以被单独复制，并且不同的应用程序可以向同一个爬行系统发出请求，这是如此设计的另一个因素²。第二部分将介绍体系结构方面的更多细节。注意，这种将爬行器划分成应用程序和系统的划分方法是我们这个系统的设计方案，某些其他系统并不使用³，并且某个特定任务在哪个组件中被处理也不是显而易见的。在这篇论文中，我们只关注宽度优先爬行器作为爬行应用程序的情况。

1.3 对爬行器的要求

现在我们讨论一个优良的爬行器应该具备的条件，以及实现方法。接下来的部分将对解决方案做详细描述。

灵活性：如上所述，我们希望能够尽量少做修改就能在多种方案中使用这个系统。

低成本和高性能：系统能达到每秒至少下载数百个网页、每次运行至少下载数亿个网页的要求，并且要运行在低成本的硬件上。注意，当主要数据结构，比如“URL seen”结构和爬行边界，对主存来说变得很庞大时，有效利用磁盘存取对于维持高速度是至关重要的。这只有在下载了数百万个网页后才会发生。

健壮性：这有若干方面。首先，由于系统要和上百万台服务器交互，所以它必须容忍不良的HTML，不同寻常的服务器行为和配置以及其他很多意外的问题。这时，我们的目标不只是要发出警告，如果必要的话，可以忽略网页甚至整个有意外行为的服务器，因为在很多应用中，我们无论如何也只能下载部分网页。其次，因为一次爬行需要几周或几个月的时间，所以系统要求能够容忍系统崩溃和网络中断，而不丢失（太多的）数据。因此，系统的状态需要记录在磁盘上。注意，我们不要求真正具有ACID性质。相反，我们定期把主要结构保存到磁盘上，当系统崩溃后只重爬少数网页。

规则和速度控制：遵守机器人排除的标准约定（robots.txt和robots meta标签）是极其重要的，它为爬行器提供可接触的URL并监控爬行行为。另外，我们需要能够用多种方式控制爬行速度。我们必须避免给单个服务器带来太多负荷；除非另外指定，我们通过和每个站点在30秒内只进行一次连接来实现。我们也希望在域的层次上调节速度以便使小规模的域不超过负载能力，还有其他原因稍后进行解释。最后，因为我们处在校园环境下，和其他很多用户共享连接，所以我们也需要控制爬行器的总下载速度。特别地，我们在白天网络繁忙时以低速爬行，在夜晚和早晨以较高速度爬行，这主要受校园主干线的负载能力的限制。

可操纵性和可重构性：我们需要一个合适的界面来监视爬行状态，包括爬行器的速度、主机和网页的统计资料以及主要数据集的大小。程序管理员应该能够调节速度，添加和删除组件，关闭系统，强加一个检查点，或者向爬行器应该避免的“黑名单”中添加主机和域。系统崩溃或关闭后，可以修改软件来解决问题，同时我们也希望可以使用不同的机器配置继续爬行。事实上，在我们第一个大型爬行任务结束时的软件和开始时的有明显不同，这是因为下载了几千万个网页后，对大量的定位和外延的需要变得尤为明显。

1.4 论文的内容

在这篇论文中，我们将描述一种运行在工作站网络上的分布式Web爬行器的设计与实现。这个爬行器每秒至少可以抓取几百个网页，还可以处理系统崩溃和其他问题，并且适用于各种爬行应用程序。我们将描述这个系统的软件体系结构，讨论其性能瓶颈，并介绍取得高性能的有效技术。我们还将报告通过抓取5百万主机上的1亿2千个网页而获得的初步实验结果。

这篇论文的其余部分是这样组织的。第二部分描述系统的体系结构和主要组件，第三部分更详细地描述所使用的数据结构和算法。第四部分介绍初步的实验结果。第五部分就我们的设计和其他系统进行比较。最后，第六部分做总结评论。

如有错误，请批评指正！