《A introduction to Heritrix》翻译(续)

 

今天上海最高温度在35摄氏度以上,在这样的热力中翻译不是件容易的事情,本来天气就让我烦躁不安了,翻译需要的耐心琢磨更增强了这种躁动。好不容易翻译了几小段,还不大让自己满意...

----------------------------------------------------------------------------------------------------------------------------

关键部件:

这一节我们将对图1所示的每个部件进行更详细的介绍。

从很多方面来看,web管理控制台是一个独立的web应用程序,内嵌一个Jetty Java HTTP服务器。在它的web页面上,可以进行选择爬行部件和参数的操作(可以选择爬行部件和参数?),构成一个crawlOrder---一个也有外部XML描述文件的配置对象。

通过将这个crawlOrder传递给crawlController来初始化一次爬行任务,CrawlController是一个对所有配置部件进行实例化和存储引用的部件,是一次爬行任务的全局上下文:通过它,所有的子部件相互联系。Web管理控制台也是通过它来控制一次爬行任务。

CrawlOrder包含了创建爬行范围部件(the Scope范围部件?)的充足信息,范围部件利用初始化URIs,为了边界部件提供基础边界,并协同决定后期发现的URIs是否也归入队列。

边界部件负责整理要访问的URIs,确保URIs不必要地再次被访问,限制爬虫访问任意一个远程网站。它通过维护一系列将被访问的URIs内部队列,以及一个已被访问或者已排队的URIs列表来实现这些目标。

仅当以一种和配置的礼貌策略相兼容的方式进行抓取时,URIs才从队列中释放出来。默认[提供的]边界实现(implementation?)主要提供了广度优先---抓取顺序的策略,用于选择URI进行处理,优先选择新网站作为开始[继续爬行],而不是正要完成的网站[所含链接]。

-----------------------------------------------------------英文原文--------------------------------------------------------

Key Components

      In this section we go into more detail on each of the components featured in Figure 1.

 

     The Web Administrative Console is in many ways a standalone web application,hosted by the embedded Jetty Java HTTP server. Its web pages allow the operator to choose a crawl's components and parameters by composing a CrawlOrder, a configuration object that also has an external XML representation.

      A crawl is initiated by passing this CrawlOrder to the CrawlController, a component which instantiates and holds references to all configured crawl components. The CrawlController is the crawl's global context: all subcomponents can reach each other through it. The Web Administrative Console controls the crawl through the CrawlController.

      The CrawlOrder contains sufficient information to create the Scope. The Scope seeds the Frontier with initial URIs and is consulted to decide which later-discovered URIs should also be scheduled. 

      The Frontier has responsibility for ordering the URIs to be visited, ensuring URIs are not revisited unnecessarily, and moderating the crawler's visits to any one remote site. It achieves these goals by maintaining a series of internal queues of URIs to be visited, and a list of all URIs already visited or queued. URIs are only released from queues for fetching in a manner compatible with the configured politeness policy. The default provided Frontier implementation offers a primarily breadth-first, order-of-discovery policy for choosing URIs to process, with an option to prefer finishing sites in progress to beginning new sites. Other Frontier implementations are possible.

 

 

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值