Design and Implementation of a High-Performance Distributed Web Crawler
Vladislav Shkapenyuk Torsten Suel
CIS Department
Polytechnic
University
Brooklyn,
NY
11201
vshkap@research.att.com, suel@poly.edu
Abstract Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis.Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance.In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost. In this paper, we describe the design and implementation of a distributed web crawler that runs on a network of workstations.The crawler scales to (at least) several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications.We present the software architecture of the system, discuss the performance bottlenecks, and describe efficient techniques for achieving high performance.We also report preliminary experimental results based on a crawl of 120 million pages on 5 million hosts.
1 Introduction
The World Wide Web has grown from a few thousand pages in 1993 to more than two billion pages at present.Due to this explosion in size, web search engines are becoming increasingly important as the primary means of locating relevant information. Such search engines rely on massive collections of web pages that are acquired with the help of web crawlers, which traverse the web by following hyperlinks and storing downloaded pages in a large database that is later indexed for efficient execution of user queries. Many researchers have looked at web search technology over the last few years, including crawling strategies, storage, indexing, ranking techniques, and a significant amount of work on the structural analysis of the web and web graph; see [1, 7] for overviews of some recent work and [26, 2] for basic techniques.
Thus, highly efficient crawling systems are needed in order to download the hundreds of millions of web pages indexed by the major search engines. In fact, search engines compete against each other primarily based on the size and currency of their underlying database, in addition to the quality and response time of their ranking function. Even the largest search engines, such as Google or AltaVista, currently cover only limited parts of the web, and much of their data is several months out of date. (We note, however, that crawling speed is not the only obstacle to increased search engine size, and that the scaling of query throughput and response time to larger collections is also a major issue.)
A crawler for a large search engine has to address two issues. First, it has to have a good crawling strategy, i.e., a strategy for deciding which pages to download next. Second, it needs to have a highly optimized system architecture that can download a large number of pages per second while being robust against crashes, manageable, and considerate of resources and web servers. There has been some recent academic interest in the first issue, including work on strategies for crawling important pages first [12, 21], crawling pages on a particular topic or of a particular type [9, 8, 23, 13], recrawling (refreshing) pages in order to optimize the overall “freshness” of a collection of pages [11, 10], or scheduling of crawling activity over time [25].
In contrast, there has been less work on the second issue. Clearly, all the major search engines have highly optimized crawling systems, although details of these systems are usually proprietary. The only system described in detail in the literature appears to be the Mercator system of Heydon and Najork at DEC/Compaq [16], which is used by AltaVista. (Some details are also known about the first version of the Google crawler [5] and the system used by the Internet Archive [6].) While it is fairly easy to build a slow crawler that downloads a few pages per second for a short period of time, building a high-performance system that can download hundreds of millions of pages over several weeks presents a number of challenges in system design, I/O and network efficiency, and robustness and manageability.
Most of the recent work on crawling strategies does not address these performance issues at all, but instead attempts to minimize the number of pages that need to be downloaded, or maximize the benefit obtained per downloaded page. (An exception is the work in [8] that considers the system performance of a focused crawler built on top of a general-purpose database system, although the throughput of that system is still significantly below that of a high-performance bulk crawler.)In the case of applications that have only very limited bandwidth that is acceptable. However, in the case of a larger search engine, we need to combine good crawling strategy and optimized system design.
In this paper, we describe the design and implementation of such an optimized system on a network of workstations. The choice of crawling strategy is largely orthogonal to our work. We describe the system using the example of a simple breadth-first crawl, although the system can be adapted to other strategies. We are primarily interested in the I/O and network efficiency aspects of such a system, and in scalability issues in terms of crawling speed and number of participating nodes. We are currently using the crawler to acquire large data sets for work on other aspects of web search technology such as indexing, query processing, and link analysis. We note that high-performance crawlers are currently not widely used by academic researchers, and hence few groups have run experiments on a scale similar to that of the major commercial search engines (one exception being the WebBase project [17] and related work at Stanford). There are many interesting questions in this realm of massive data sets that deserve more attention by academic researchers.