《An introduction to Heritrix》章节翻译

最新推荐文章于 2021-02-16 11:20:15 发布

youngster21

最新推荐文章于 2021-02-16 11:20:15 发布

阅读量1.4k

点赞数

分类专栏： Try to translate 文章标签： archive 网络爬虫互联网 internet components 存储

Try to translate 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

学校要求毕业设计要有8000字符英文参考资料的翻译（不知道学校为什么要以字符来算，而不是按单词数或者中文字数来算，如果资料中有长段代码那岂不是很占便宜？），然后就有了这篇日志...好像这篇资料还没有中文版，翻译过程中参考了一些网络资料，特别是http://codex.wordpress.org.cn/Heritrix

--------------------------------------------------以下是我的翻译，英汉对照------------------------------------------------

An Introduction to Heritrix

An open source archival quality web crawler

Gordon Mohr, Michael Stack, Igor Ranitovic, Dan Avery and Michele Kimpton

Internet Archive Web Team

{gordon,stack,igor,dan,michele}@archive.org

Abstract.

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. The Internet Archive started Heritrix development in the early part of 2003. The intention was to develop a crawler for the specific purpose of archiving websites and to support multiple different use cases including focused and broad crawling. The software is open source to encourage collaboration and joint development across institutions with similar needs. A pluggable, e xtensible architecture facilitates customization and outside contribution. Now, after over a year of development, the Internet Archive and other institutions are using Heritrix to perform focused and increasingly broad crawls.

Heritrix 简介

一个开源归档质量网络爬虫

Gordon Mohr, Michael Stack, Igor Ranitovic, Dan Avery and Michele Kimpton

互联网归档网络团队

{gordon,stack,igor,dan,michele}@archive.org

摘要

Heritrix 是一个美国互联网存储协会 ( 互联网档案馆 ?) 负责的，开源，可扩展，网络规模，归档质量的网络爬虫项目。互联网存储协会于 2003 年初开始 Heritrix 的开发，目的是开发一个爬虫专门用于对网站进行归档，以及支持多种不同的用例（？），包括聚焦和宽带爬取数据。这个软件开放源码以鼓励具有相同需求的机构之间相互合作，共同开发。可插入，可扩展的架构方便了定制和外部贡献。经过一年的开发，目前，互联网存储协会以及其他机构正使用 Heritrix 进行聚焦和越来越多的宽带数据爬取。

Introduction

The Internet Archive (IA) is a 5013C non-profit corporation, whose mission is to build a public Internet digital l ibrary. Over the last 6 years, IA has built the largest public web archive to date, hosting over 400 TB of data.

简介

互联网存储协会 (IA) 是一个 5013C 非营利组织，它的使命是建立一个公共互联网电子图书馆。在过去的 6 年， IA 已经建立起目前最大的公共网络档案馆，拥有超过 400TB 的数据。

The Web Archive is comprised primarily of pages collected by Alexa Internet starting in 1996. Alexa Internet is a Web cataloguing company founded by Brewster Kahle and Bruce Gilliat in 1996. Alexa Internet takes a snapshot of the web every 2 months, currently collecting 10 TB of data per month from over 35 million sites. Alexa Internet donates this crawl to the Internet Archive, and IA stores and indexes the collection. Alexa uses its own proprietary software and techniques to crawl the web. This software is not available to Internet Archive or other institutions for use or extension.

网络档案馆主要由始于 1996 年的 Alexa Internet 搜集的网页构成。 Alexa Internet 是一家网络编目公司，由 Brewster Kahle 和 Bruce Gilliat 创建于 1996 。 Alexa Internet 每两个月对网络进行一次快照，目前，每个月从 3500 万多个网站上搜集 10TB 的数据。 Alexa Internet 将这些爬取的数据捐赠给互联网存储协会，由 IA 负责对这些数据进行存储和索引。 Alexa 使用其专有软件和技术爬行网络，对于互联网存储协会以及其他机构来说，是无权使用和扩展这款软件的。

In the latter part of 2002, the Internet Archive wanted the ability to do crawling internally for its own purposes and to be able to partner with other institutions to crawl and archive the web in new ways. The Internet Archive concluded it needed a large-scale, thorough, easily customizable crawler. After doing an evaluation of other

open source software available at the time it was concluded no appropriate software existed that had the flexibility required yet could scale to perform broad crawls.

2002 年下半年，互联网存储协会希望获得爬取数据的能力以实现其自身目的，并能够以新的方式与其他机构进行合作爬取和归档网络数据，最后得出结论：这需要一个大规模，全面，可定制的网络爬虫。通过评估当时其他可用的开源软件，认为不存在合适的软件足够灵活以适应宽带数据爬取任务。

The Internet Archive believed it was essential the software be open source to promote collaboration between institutions interested in archiving the web. Developing open source software would encourage participating institutions to share crawling experiences, solutions to common problems, and even the development of new features.

互联网存储协会相信，这款软件应开放源码以促进同样对归档网络数据感兴趣的机构之间的合作，这一点非常重要。开发开源软件将鼓励参与的机构分享数据爬取的经验，常见问题的解决方案，甚至新特性的开发。

The Internet Archive began work on this new open source crawler development project in the first half of 2003. It named the crawler Heritrix. This paper gives a high-level overview of the Heritrix crawler, circa version 1.0.0 (August, 2004). It outlines the original use-cases, , the general architecture, current capabilities and current limitations. It also describes how the crawler is currently used at the Internet Archive and future plans for development both internally and by partner institutions.

2003 年上半年互联网存储协会开始致力于新型开源爬虫的开发项目，并将其命名为 Heritrix 。本文将大致综述（ high-level overview ） Heritrix 爬虫（ circa ？）版本 1.0.0 （ 2004 年 8 月），概述最初的用例，总体架构，目前的能力以及限制，描述目前互联网存储协会如何使用这款爬虫，以及自身和合作机构未来的开发计划。

Use Cases

The Internet Archive and its collaborators wanted a crawler capable of each of the following crawl types:

用例

互联网存储协会及其合作者需要一个能够实现下列数据爬取类型的爬虫：

Broad crawling: Broad crawls are large, high-bandwidth crawls in which the number of sites and the number of valuable individual pages collected is as important as the completeness with which any one site is covered. At the extreme, a broad crawl tries to sample as much of the web as possible given the time, bandwidth, and storage resources available.

宽带数据爬取：宽带爬行是指大规模的，高带宽的爬行，爬行过程中，网站数量以及搜集的有价值的单个网页数量同完全地爬取一个网站一样重要。极端地说就是给定时间，带宽以及可利用的存储介质，一次宽带爬行会尽可能多地对网络进行取样。

Focused crawling: Focused crawls are small- to medium-sized crawls (usually less than 10 million unique documents) in which the quality criterion is complete coverage of some selected sites or topics.

聚焦数据爬取：聚焦爬行是中小等规模的爬行（经常是少于 1000 万不同的文件），爬行的质量标准是是否完全覆盖所选择的站点或主题。

Continuous crawling: Traditionally, crawlers pursue a snapshot of resources of interest, downloading each unique URI one time only. Continuous crawling, by contrast, revisits previously fetched pages – looking for changes – as well as discovering and fetching new pages, even adapting its rate of visitation based on operator parameters and estimated change frequencies.

持续数据爬取：传统上，爬虫对相关资源进行快照，一次仅下载一个不相同的 URI 。持续数据爬取，恰好相反，重复访问以前抓取过的网页 --- 寻找变化 --- 同时发现并抓取新的网页，甚至根据操作参数和预估的变化频率相应地改变访问率。

Experimental crawling: The Internet Archive and other groups want to experiment with crawling techniques in areas such as choice of what to crawl, order in which resources are crawled, crawling using diverse protocols, and analysis and archiving of crawl results.

试验数据爬取：互联网存储协会和其他组织希望就不同方面的数据爬取技术进行试验，比如，选择爬行什么，所爬资源的顺序，使用不同的协议进行爬取，以及爬行结果的分析与归档。

Required Capabilities

Based on the above use cases, the Internet Archive compiled a list of the capabilities required of a crawling program. An important contribution in developing the archival crawling requirements came from the efforts of the International Internet Preservation Consortium (IIPC) a consortium of twelve National Libraries and the Internet Archive. The mission of the IIPC is to acquire, preserve and make accessible knowledge and information from the Internet for future generations everywhere, promoting global exchange and international relations. IIPC member libraries have diverse web resource collection needs, and contributed several detailed requirements that helped define the goals for a common crawler. The detailed requirements document developed by the IIPC can be found at: "http://netpreserve.org/publications/iipc001.pdf"http://netpreserve.org/publications/iipc-d-001.pdf.

必需的功能

基于上述用例，互联网存储协会编制了一个爬虫程序所必需的功能列表。归档爬取功能的开发过程中，国际互联网保存联盟（ IIPC ） ---12 个国家图书馆和互联网存储协会的联盟 --- 作出了重要贡献。 IIPC 的使命是获取，保存并从互联网数据中整理出后人能够随时随地访问的知识信息，以促进全球交流，改善国际关系。 IIPC 成员图书馆有不同的网络资源搜集需求，这为定义一个通用爬虫应具备的功能提供了许多细节上的要求。可在 "http://netpreserve.org/publications/iipc001.pdf 上找到 IIPC 开发的具体需求文档。

The Heritrix Project

Heritrix is an archaic word for heiress, a woman who inherits. Since our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, this name seemed appropriate.

Heritrix 项目

Heritrix 是 heiress 的古时用词，意思是女继承人（嗣女）。因为我们的爬虫是为了未来研究人员和后人的福利而收集保存我们文化的电子人工制品，所以这个名字似乎是恰当的。

Java was chosen as the implementation software language. As a high-level, object-oriented language, Java offers strong support for modular design and components that are both incrementally extendable and individually replaceable. Other key reasons for choosing Java were its rich set of quality open source libraries and its large developer community, making it more likely that we would benefit from previous work and outside contributions.

Java 被选为软件的实现语言。作为一种高级的，面向对象语言， Java 对可逐步扩展和单个可替换的模块设计以及部件提供了强大的支持。选择 Java 的其他主要原因是其丰富良好的开放源码库以及庞大的开发者社区，使得我们将更有可能从先前的工作和外部贡献中获益。

The project homepage is <http://crawler.archive.org>, featuring the most current information on the project, downloads of the crawling software, and project documents. There are also public databases of outstanding feature requests and bugs. The project also provides an open mailing list to promote exchange of information between Heritrix developers and other interested users.

项目主页是 <http://crawler.archive.org> ，专门提供项目的最新信息，爬虫软件的下载以及项目文档，也有显著功能请求和臭虫的公共数据库。项目也提供了一个开放邮件列表以促进 Heritrix 开发者和其他感兴趣的用户之间的信息交流。

Sourceforge [SOURCEFORGE], a site offering free online services for over 84,000 open source software efforts, hosts the Heritrix project. Sourceforge provides many of the online collaborative tools required to manage distributed-team software projects such as:

• Versioned source code repository (CVS)

• Issue databases for tracking bugs and enhancement requests

• Web hosting

• Mirrored, archived, high-availability file release system

• Mailing lists

Sourceforge[SOURCEFORGE]--- 一个为超过 84000 个开源软件行动提供免费在线服务的网站 --- 托管着 Heritrix 项目。 Sourceforge 提供了许多在线协作工具用于管理分布式团队软件项目，比如：

•版本化源码库（ CVS--- 版本控制系统）

•臭虫跟踪和扩展请求的发布数据库

•项目网络托管

•镜像模式，归档化，高可用的文件发布系统

•邮件列表

A large community of developers knows and uses Sourceforge, which can help to create awareness of and interest in the software. The limitations of Sourceforge include:

• Occasionally unavailable or slow

• No direct control over problems that do arise

• Issue-tracking features and reporting are crude

• Can not be used to manage other internal software projects which may not be open source

大型开发者社区大都知道并使用 Sourceforge ，这有助于激发对软件的关注和兴趣。 Sourceforge 的局限包括：

• 临时性不可用或访问缓慢

• 无法直接控制产生的问题

• 事件跟踪特征和报告比较粗糙。

• 无法用于管理其他非开源内部软件项目

Heritrix is licensed under the Gnu Lesser General Public License (LGPL) [LGPL]. The LGPL is similar to the Gnu General Public License (GPL) in that it offers free access to the full source code of a program for reuse, extension, and the creation of derivative works, under the condition that changes to the code are also made freely available. However, it differs from the full GPL in allowing use as a module or library inside other proprietary, hidden source applications, as long as changes to the library are shared.

Heritrix 在 GNU 较宽松通用公共许可证 (LGPL) 下授权。 LGPL 类似于 GNU 通用公共许可证 (GPL), 都允许免费获取程序完整源码用于再使用，扩展以及衍生产品的创造，但也必须允许别人免费获取你对代码的变更。然而，和完全 GPL 不同之处在于允许在其他专利产品中使用开源模块或者代码库，隐藏源应用程序，只要对代码库的变更是共享的。

Milestones

Since project inception, major project milestones have included:

-Investigational crawler prototype created, and various threading and network access strategies tested, 2 ^nd Q 2003

-Core crawler without user-interface created, to verify architecture and test coverage compared to HTTrack [HTTRACK] and Mercator [MERCATOR] crawlers, 3 ^rd Q 2003

-Nordic Web Archive [NWA] programmers join project in San Francisco, 4 ^th Q 2003 – 1 ^st Q 2004, adding a web user-interface, a rich configuration framework, documentation, and other key improvements

-First public release (version 0.2.0) on Sourceforge, January 2004

-First contributions by unaffiliated developer, January 2004

-Workshops with National Library users in February and June of 2004

-Used as the IA’s crawler for all contracted crawls – usually consisting of a few dozen to a few hundred sites of news, cultural, or public-policy interest – since beginning of 2004.

-Adopted as the official crawler for the NWA, June 2004

-Version 1.0.0 official release, for focused and experimental crawling, in August 2004

重大事件 ( 里程碑 )

从项目开始至今，主要的里程碑事件包括：

- 创建了用于调研的爬虫原型，测试了不同数量的线程（ various threadings ）和网络访问策略 ,2 ^nd Q 2003 。

- 创建了不带用户界面的爬虫内核，用于检验架构和相比于 HTTrack[HTTRACK] 与 Mercator[MERCATOR] 爬虫的测试覆盖率。

- 北欧网络存储协会 [NWA] 的程序员在洛杉矶加入项目组， 4 ^th Q 2003 – 1 ^st Q 2004, 添加了一个网页用户界面，一个可充分配置的架构，文档，以及其他关键性的改进。

-2004 年 1 月，首个公共版本 ( 版本 0.2.0) 在 Sourceforge 上发布。

-2004 年 1 月，首个自由开发者贡献版本发布 ( ？ )

-2004 年 2 月， 6 月，与国家图书馆用户举行专题研讨会。

- 从 2004 年初开始使用 IA 的爬虫进行所有承担的 (contracted 承包的？合同的？ ) 爬行任务 --- 经常包含几十到几百个新闻，文化或者关注公共政策的网站。

-2004 年 6 月，被 NWA 采用为官方爬虫软件。

-2004 年 8 月，版本 1.0.0 正式发布，可进行聚焦和试验数据爬取。

Heritrix Crawler Architecture

In this section we give an overview of the Heritrix architecture, describing the general operation and key components.

At its core, the Heritrix crawler was designed as a generic crawling framework into which various interchangeable components can be plugged. Varying these components enables diverse collection and archival strategies, and supports the incremental evolution of the crawler from limited features and small crawls to our ultimate goal of giant full-featured crawls.

Heritrix 爬虫体系结构

这一节中我们将概述 Heritrix 的体系结构，描述一般操作流程和关键部件。

核心上，作为一个通用的爬行架构，爬虫 Heritrix 被设计成可插入各种可互换部件。改变这些部件可产生不同的搜集归档策略，并有助于爬虫的增量式进化，从一个功能有限的小爬虫演变成一个终极目标上庞大的全能的爬虫。

Crawl setup involves choosing and configuring a set of specific components to run.Executing a crawl repeats the following recursive process, common to all web crawlers, with the specific components chosen:

1.Choose a URI from among all those scheduled

2.Fetch that URI

3.Analyze or archive the results

4.Select discovered URIs of interest, and add to those scheduled

5.Note that the URI is done and repeat

爬行任务的设置包括选择配置一系列特定的运行部件。一次爬行任务的执行过程，与所有网络爬虫一样，就是利用选择的特定部件重复下列循环的过程：

1. 在预定的 URI 中选择一个。

2. 获取该 URI 指向的网页内容

3. 分析或者归档爬得的结果

4. 选择已经发现的感兴趣的 URI 。加入预定队列。

5. 标记已处理过的 URI ，然后重复上述过程。

The three most prominent components of Heritrix are the Scope, the Frontier, and the Processor Chains, which together serve to define a crawl.

The Scope determines what URIs are ruled into or out of a certain crawl. The Scope includes the "seed" URIs used to start a crawl, plus the rules used in step 4 above to determine which discovered URIs are also to be scheduled for download.

Heritrix 主要有三大部件：范围部件，边界部件和处理器链，三者共同定义一次爬行任务。

范围部件，主要按照规则决定哪些 URI 进入爬取的队列。范围部件包括用于开始一次爬行的种子 URI ，加上上述步骤 4 中的规则从而决定哪些被发现的 URI 进入下载的队列。

The Frontier tracks which URIs are scheduled to be collected, and those that have already been collected. It is responsible for selecting the next URI to be tried (in step 1 above), and prevents the redundant rescheduling of already-scheduled URIs (in step 4 above).

边界部件，跟踪哪个预定的 URI 将被收集，以及已经被收集的 URI 。负责选择下个要被尝试的 URI （上述步骤一），阻止已处理过的 URL 再次进入预定队列 ( 上述步骤四 ) 。

The Processor Chains include modular Processors that perform specific, ordered actions on each URI in turn. These include fetching the URI (as in step 2 above), analyzing the returned results (as in step 3 above), and passing discovered URIs back to the Frontier (as in step 4 above).

处理器链包含几个模块化的处理器，依次对每个 URI 进行特定有序的操作 --- 抓取 URI 指向的网页内容 ( 正如上述步骤二 ) ，分析返回的结果 ( 正如上述步骤三 ) ，将发现的 URI 返回给边界处理器（正如上述步骤四）。

Figure 1 shows these major components of the crawler, as well as other supporting components, with major relationships highlighted.

图示 1 描绘了爬虫的主要部件，以及其余支撑部件，突出了部件之间的主要关系。

Figure 1: The Web Administrative Console composes a CrawlOrder, which is then used to create a working assemblage of components within the CrawlController. Within the CrawlController, arrows indicate the progress of a single scheduled CrawlURI within one ToeThre