一种高性能分布式Web爬行器的设计与实现(一)

最新推荐文章于 2023-11-04 21:50:40 发布

alphaleng

最新推荐文章于 2023-11-04 21:50:40 发布

阅读量1.3k

点赞数

分类专栏：毕业设计文章标签： web search system 搜索引擎 performance collections

毕业设计专栏收录该内容

5 篇文章 0 订阅

订阅专栏

原文：

Design and Implementation of a High-Performance Distributed Web Crawler

Vladislav Shkapenyuk Torsten Suel

CIS Department

Polytechnic University

Brooklyn , NY 11201

vshkap@research.att.com, suel@poly.edu

Abstract
Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance.In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost.
In this paper, we describe the design and implementation of a distributed web crawler that runs on a network of workstations.The crawler scales to (at least) several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications.We present the software architecture of the system, discuss the performance bottlenecks, and describe efficient techniques for achieving high performance.We also report preliminary experimental results based on a crawl of 120 million pages on 5 million hosts.

1 Introduction

The World Wide Web has grown from a few thousand pages in 1993 to more than two billion pages at present.Due to this explosion in size, web search engines are becoming increasingly important as the primary means of locating relevant information. Such search engines rely on massive collections of web pages that are acquired with the help of web crawlers, which traverse the web by following hyperlinks and storing downloaded pages in a large database that is later indexed for efficient execution of user queries. Many researchers have looked at web search technology over the last few years, including crawling strategies, storage, indexing, ranking techniques, and a significant amount of work on the structural analysis of the web and web graph; see [1, 7] for overviews of some recent work and [26, 2] for basic techniques.

Thus, highly efficient crawling systems are needed in order to download the hundreds of millions of web pages indexed by the major search engines. In fact, search engines compete against each other primarily based on the size and currency of their underlying database, in addition to the quality and response time of their ranking function. Even the largest search engines, such as Google or AltaVista, currently cover only limited parts of the web, and much of their data is several months out of date. (We note, however, that crawling speed is not the only obstacle to increased search engine size, and that the scaling of query throughput and response time to larger collections is also a major issue.)

A crawler for a large search engine has to address two issues. First, it has to have a good crawling strategy, i.e., a strategy for deciding which pages to download next. Second, it needs to have a highly optimized system architecture that can download a large number of pages per second while being robust against crashes, manageable, and considerate of resources and web servers. There has been some recent academic interest in the first issue, including work on strategies for crawling important pages first [12, 21], crawling pages on a particular topic or of a particular type [9, 8, 23, 13], recrawling (refreshing) pages in order to optimize the overall “freshness” of a collection of pages [11, 10], or scheduling of crawling activity over time [25].

In contrast, there has been less work on the second issue. Clearly, all the major search engines have highly optimized crawling systems, although details of these systems are usually proprietary. The only system described in detail in the literature appears to be the Mercator system of Heydon and Najork at DEC/Compaq [16], which is used by AltaVista. (Some details are also known about the first version of the Google crawler [5] and the system used by the Internet Archive [6].) While it is fairly easy to build a slow crawler that downloads a few pages per second for a short period of time, building a high-performance system that can download hundreds of millions of pages over several weeks presents a number of challenges in system design, I/O and network efficiency, and robustness and manageability.

Most of the recent work on crawling strategies does not address these performance issues at all, but instead attempts to minimize the number of pages that need to be downloaded, or maximize the benefit obtained per downloaded page. (An exception is the work in [8] that considers the system performance of a focused crawler built on top of a general-purpose database system, although the throughput of that system is still significantly below that of a high-performance bulk crawler.)In the case of applications that have only very limited bandwidth that is acceptable. However, in the case of a larger search engine, we need to combine good crawling strategy and optimized system design.

In this paper, we describe the design and implementation of such an optimized system on a network of workstations. The choice of crawling strategy is largely orthogonal to our work. We describe the system using the example of a simple breadth-first crawl, although the system can be adapted to other strategies. We are primarily interested in the I/O and network efficiency aspects of such a system, and in scalability issues in terms of crawling speed and number of participating nodes. We are currently using the crawler to acquire large data sets for work on other aspects of web search technology such as indexing, query processing, and link analysis. We note that high-performance crawlers are currently not widely used by academic researchers, and hence few groups have run experiments on a scale similar to that of the major commercial search engines (one exception being the WebBase project [17] and related work at Stanford). There are many interesting questions in this realm of massive data sets that deserve more attention by academic researchers.

译文：

摘要
像很多专门的搜索工具一样，各主要的 Web 搜索引擎依靠 Web 爬行器来获取大量的网页用来建立索引和分析。 Web 爬行器会在几个星期或几个月内和上百万主机交互，因此讨论它的健壮性、灵活性和可操纵性有着重要的意义。另外，还必须考虑 I/O 性能、网络资源和操作系统限制等因素，用合理的成本取得高性能。
在这篇论文中，我们将描述一种运行在网络工作站上的分布式 Web 爬行器的设计和实现。这个爬行器（每秒至少可以爬几百个网页）可以处理系统崩溃和其他问题，并且适用于各种爬行应用系统。我们将介绍这个系统的软件体系结构，讨论性能瓶颈，并描述取得高性能的有效技术。我们还将报告爬行 5 百万主机上的 1 亿 2 千个网页的初步实验结果。

1 引言

从1993年至今，万维网已经从几千个网页发展到20亿以上个网页。由于规模的激增，Web搜索引擎作为定位相关信息的主要方法正在变得日益重要。这些搜索引擎依赖于大量搜集的网页，这些网页是在Web爬行器的帮助而获得的，爬行器通过跟踪超链接遍历网络并且把下载的网页保存在一个大的数据库，稍后索引这个数据库用来高效地执行用户的查询。在过去的几年里，很多研究者已经对Web搜索技术做了研究，包括爬行策略、存储、建立索引、排序技术，在网络和网络图方面也做了很多工作；【1，7】概述了最近的一些工作，【26，2】概述了一些基本技术。

因此，需要高效的爬行系统来下载被大型搜索引擎索引的数亿个网页。实际上，搜索引擎之间的竞争除了在排序方法的质量和响应时间上外，主要是在他们的底层数据库的规模和流量上。即使是一些大型的搜索引擎（比如Google或者Alta Vista），目前也只能覆盖网络的有限部分，而且他们的很多数据已经过时几个月了。（注意，爬行速度不是增加搜索引擎规模的仅有障碍，对大量搜索结果的查询吞吐率和响应时间也是一个大问题。）

一个大型搜索引擎的爬行器必须解决两个问题。首先，它必须有一个好的爬行策略，比如决定接下来要下载哪个网页的策略。其次，它需要有一个非常优化的系统结构，使它不仅可以每秒下载大量网页，还可以抵御系统崩溃，而且便于管理，与资源和Web服务器的友好交互。就第一个问题，有一些最近的学术关注，其中包括提出优先爬行重要网页【12，21】，爬行特定主题或特定类型的网页【9，8，23，13】，重爬（更新）网页以便于优化一个网页集中的新网页【11，10】，或者爬行行为的调度【25】等的策略。

可是，就第二个问题所作的工作很少。很明显，所有大型搜索引擎都有非常优化的爬行系统，但是这些系统的详细资料是保密的。在文献中详细描述的仅有系统应该是DEC/Compaq的Heydon和Najork开发的被AltaVista采用的Mercator系统【16】。（关于Google爬行器的第一个版本【5】和Internet Archive所采用的系统【6】也有一些详细资料。）虽然构造一个在短期内应用的每秒下载少许网页的缓慢爬行器相当简单，但是构造一个可以在几周内下载数亿网页的高性能系统在系统设计、I/O和网络性能、健壮性和可操纵性方面存在大量难题。

最近在爬行策略上所做的大部分工作根本不处理这些性能问题，相反却试图最小化需要下载的网页的数量或者最大化从每个下载下来的网页获得的利益。（参考资料【8】所做的工作是个例外，它考虑了基于一个通用数据库系统的聚焦式爬行器的系统性能，尽管它的吞吐量仍然明显低于高性能爬行器。）这在只有非常有限的带宽的应用系统中是可接受的。但是在大型搜索引擎中，需要把优良的爬行策略和优化的系统设计结合起来。

在这篇论文中，我们描述了这样一种运行在网络工作站上的优化的系统的设计和实现。爬行策略的选择和我们的工作无关。我们用一个简单的宽度优先的爬行策略描述这个系统，尽管这个系统也适用于其它策略。我们主要关心这个系统的I/O和网络性能以及在爬行速度和参与节点的数量方面的伸缩性问题。我们现在正在用这个爬行器获取大型数据集来从事Web搜索技术的其他方面，比如建立索引、处理查询请求和链接分析。注意，高性能爬行器目前还没有被学者广泛使用，因此很少有组织像大型商用搜索引擎那样在规模上做过实验（一个例外是斯坦福大学的WebBase计划【17】和相关的工作）。在大量数据集领域有很多有趣的问题，这应该受到学者的更多关注。

如有错误，请批评指正！