nutch2.0抓取流程--nutch2crawling

最新推荐文章于 2021-01-13 18:38:55 发布

sunny6142496

最新推荐文章于 2021-01-13 18:38:55 发布

阅读量1.7k

点赞数

分类专栏： nutch

nutch 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

原文http://wiki.apache.org/nutch/Nutch2Crawling第一次翻译可能有不合适的地方欢迎大家指正，红色字体部分是我还不太懂的地方

This document describes the crawling jobs of Nutch2 (NutchGora).

本文描述nutch2 gora版抓取作业

Introduction简介

A crawl cycle consists of 4 steps, each implemented as an Hadoop job.

一个抓取循环由4步组成，每步都是一个hadoop作业

GeneratorJob
FetcherJob
ParserJob (optionally done during fetch using 'fetcher.parse') （可以选择在抓取时执行解析）
DbUpdaterJob

To populate initial rows for the webtable you can use the InjectorJob.

使用injectorJob初始化数据库，添加一些行。

There is a single table webpage that is the input and output for these jobs. Every row in this table is an url (WebPage). To group urls from the same TLD and domain closely together, the row key is stored as url with reversed host components. This takes advantage of the fact that row keys are sorted (in most NoSQL stores). Scanning over a subset is generally a lot faster than scanning over the entire table with specific rowkey filtering. See the following example rowkey listing:

有一个表webpage作为这些工作的输入与输出。本表中的每行一个是一个网址（可对应一个WebPage实例）。为了按照相同的顶级域名和域名分组，该行键主要是储存为网址逆序。这充分利用了大多数nosql数据库按行键排序的优势。通过具体rowkey过滤时扫描一个子集通常比扫描整个表快得多。见下面的例子rowkey例子：

com.example.www:http/
com.example.www:http/about/
net.sourceforge:http/
...
org.apache.gora:http/
org.apache.nutch:http/
org.apache.www:http/
org.wikipedia.en:http/
org.wikipedia.nl:http/
Generate
Creates a new batch. Selects urls to fetch from the webtable (or: marks urls in the webtable which need to be fetched).
创建一个新的batch。从webpage表中选择url来抓取（或这说是标记webpage表中需要被抓取的url）
Mapper

Reads every url from the webtable.
1. Skip the url if it has been generated before (has a generated mark (batch ID)). This will allow you to run multiple generates.
2. (optional) Normalize the url (apply URLNormalizers) and filter the url (apply URLFilters).Disable for fast generating of urls.
3. Is it time to fetch this url? (fetchTime < now)
4. Calculate scoring for this url.
5. Outputs every url that needs to be fetched, together with its score (SelectorEntry).
从webpage中读取每一个url

如果该url已经被generate（也就是有了generate标记)则跳过，这使你可以同时运行多个generate作业。
（可选择性进行）url规格化（就是将其变为绝对路径）（使用URLNormalizers）过滤url（使用URLFilters）。不执行该步可加快速度。
查看是否到抓取该url的时间（fetchTime<当前系统时间）
计算该url的分数
输出每个要抓取的url以及其分数（输出类型是SelectorEntry）

Partitioning

All urls are partitioned by domain, host or IP (the partition.url.mode property). This means that all urls from the same domain (host, IP) end up in the same partition and will be handled by the same reduce task. Within each partition all urls are sorted by score (best first). Notes:
- (Optional) Normalize urls during partitioning. Disable for fast partitioning of urls.
- When partitioning by IP: this might be heavy on DNS resolving!
所有url按照域名，主机名或IP分片（partition.url.mode 属性配置）这意味着所有相同域名（或主机名，IP）的url会分到一个分片里并被同一个reduce任务处理。在每个分片里所有url按分数排序（最好的在前面）。注意：
（选择性执行）在分片时规格化url。不执行该步可加快速度。
当按照IP分片：这可能加重DNS解析
Reducer

Reads every url from the partition and keeps selecting urls until the total number of urls has reached a limit, or the number of urls per domain (host, IP) has reached a limit.
1. Stop including urls when we have reached topN/reducers urls. This will give topN urls for all reducers together.
2. Stop including urls for a certain domain (host, IP) if the number of urls for that domain (host, IP) exceeds generate.max.count. The reducer keeps track of this using a map. This works because all urls from the same domain (host, IP) are handled within the same reduce task. Note that if the number of different domains (hosts, IPs) per reducer is large, the map may become large. However, one can always increase the number of reducers.
3. For each selected URL, write a generate mark (batch ID).
4. Output the row to the webtable.
从分片中读取每个url，选择url直到总数达到限度，或每个域名（主机名，IP）的url数达到限度。

当达到topN时停止收集url。
停止收集某个域名（主机名，IP）的url，如果那个域名（主机名，IP）的url的数量超过generate.max.count。reduce用一个map跟踪这个任务。由于所有相同域名（主机名，IP）的url在同一个reduce任务中处理所以这个方法是有效的。注意如果每个reduce中不同的域名（主机名，IP）很多，这个map可能会很大。然而总可以增加reduce的数量。
对于每次选择的url要做generate标记（赋予一个batchID）。
输出这行到webpage

Result

A maximum of topN urls gets selected. That is: these get a generator mark in the webtable.However, there are two reasons why not always the topNbest scoring urls are selected. Or not even topN urls at all (while there are enought urls to fetch):
1. The property generate.max.count limits the number urls per domain (host, IP). So urls from domain abc.com may be skipped (we already have enough urls from abc.com) while urls with lower score from xyz.com still can get in (still room for more urls from xyz.com). This is good: we rather have lower scored urls than too much from the same domain (host, IP).
2. Each reducer (partition) may generate up to topN/reducers urls. But due to partitioning not all reducers may be able to select this amount.This is an implementation consequence. However, if number of domains (hosts, IPs) is much bigger then the number of reducers then this is not really an issue. And, we get scalability in return.
Example:

Settings:
- topN = 2500
- generate.max.count = 100
- reducers = 5
Input for partition 1:
- abc.com: 10 urls
- klm.com: 100 urls
- xyz.com: 1000 urls
Output from partition 1:
- abc.com: 10 urls
- klm.com: 100 urls
- xyz.com: 100 urls
So for this partition, only 210 urls will be selected for fetching.

最多有topN的最大值个url被选。也就是说：这些url在webpage里有generate标记。然而，有两个原因导致并不总是有topN个最佳分数的url被选。甚至完全不到topN个url（尽管有足够的url供抓取）：

generate.max.count属性限制了每个域名（主机名，IP）的url数。所以域名为abc.com的url可能被跳过（我们有足够多的域名为abc.com的url时)而域名为xyz.com的分数更低的url却仍可能被包含进来（因为仍有足够空间供给域名为xyz.com的url）。这样有一个好处：相比过多的相同域名（主机名，IP）的url我们更想要分数低一点的url。
每个reduce（分片）可能产生topN个url，但由于不是所有reduce可以选择到这个数量的url，这是一个实现的结果。然而，如果域名（主机名，IP）的数量大于reduce的数量那么这就不是一个问题了。我们会得到一个弹性的返回值。

Things for future development未来开发方向

How does this work together with fetching? Why write the generator mark back to the webtable and then select everything again, do the same partitioning and start fetching. Can’t we do the fetching in the Generator reducer directly?Possible cons:
- If the job fails, you need to do the generate from scratch.
Pros:
- No need for another job (select, partition...)
这一步任务如何与抓取合作？为什么在webpage表里做了generate标记却还要再选择每一个url，做同样的分片开始抓取。我们能在generate reduce中直接抓取吗？可能的缺点：如果作业失败就要重新生成。优点：不许要重复工作（选择，分片等）。

Is it really necessary to (re)process every url each generate?每次generate都处理每个url真的必要吗？

It sounds very inefficient to read all urls from the webtable for each generate job again, perform the checks above (calculate scoring), do sorting and partitioning...Especially when the number of urls >> topN. Sure, after fetching a batch, the webtable has changed (new urls, scoring changed) so if you really want to do things perfectly, you should check the webtable again completely... right...?
- In production it seems generating is not that expensive because there is limited data on the input. (Just the fields necessary for determining the score and whether it should be fetched).
每当再次执行generate作业时就从数据库中读取全部url，并执行上面检测（计算分数），排序以及分片等似乎是非常低效的，尤其当url的数量远远大于topN。的确，抓取完一批url后数据库就发生了变化（新的url和分数都被改变）所以如果你真的想完美的做这件事就必须完全重新检测数据库。
在生产中generate的开销也并不是很大，因为输入的是有限的数据（只有用于决定分数和其是否应该被抓取的列是必要的）。

Suggestions建议
- Use region side filtering (where possible) to check the generated mark and fetchtime.
- If scoring isn’t important (?): keep track of url limits (topN and per domain (host, IP)) in the mapper. And stop reading new input records when topN/mappers have been selected. For domain and host mode (generate.max.count) this actually may work relatively fast since the map input is sorted by inverse url.
Cons:
- Scoring is not used for selection
- Domains (hosts) at the start of a region (mapper input) have the highest chance to get selected.
- 使用区域端过滤（在可能的情况下）来检查generate标记和fetchtime。
- 如果得分是重要的（？）：跟踪mapper里的URL的限制量（TopN和每个域名（主机名，IP））。当每个mapper的topN个url已被选定停止读取新的输入记录。对于域名和主机名模式（generate.max.count）这样实际上可能工作比较快，因为图的输入是按逆序网址的。
缺点：
- 计分还没有用于选择
- 在区域前面的域名（主机名）有最多的机会被选中。
Fetch

Fetches all urls which have been marked by the generator with a given batch ID (or optionally fetch all urls)

抓取全部被标记的url，也就是generate一步赋予了batchID的url（或者选择抓取的全部url）

Mapper

Reads every url from the webtable.
1. Check if the generator mark matches.
2. If the url also has a fetcher mark, the skip it if the ‘continue’ argument is provided, else also refetch everything that has been fetched before in this batch.
3. Output each url with a random int as key (shuffle all urls)
Note: we need to go through the whole webtable here again! Possible changes:
- Use region side filters to check generator marks?
- Let the generator reducer do the fetching? Cons:
  - No random ordering of selected urls
- Put generator output in a intermediate location (file/table) and fetch from there (back to segments in Nutch 1.x)
从webpage读取每个url

检查generate标记是否匹配
如果url也有fetch标记，如果有continue则跳过他，否则也重新抓取这一批里抓取过的每个网页
输出每个url，附带一个随机整数值作为key（随机存放每个url）。
注意：这里我们需要再次遍历整个webpage表！可能的改变：

使用区域端过滤器检测generate标记？
由generate reduce做抓取？缺点是：没有被选的url随机序列
将generate输出到一个中间位置（文件或表）并从这里抓取（返回到nutch1.x系列的segments的做法）

Partition

Partition by host.

Note: Why does the fetcher mapper use its own partitioner (by host)? The fetcher reducer supports queues per host, domain and IP...

按照主机名分片
注意：为什么抓取的mapper作业用自己的分片器（按照主机名）？抓取的reducer作业支持每个主机，域和IP的队列
Reduce
1. Puts the randomized urls in fetch queues (one queue per domain (host, IP) for politeness) and scans the queues for items to fetch.
2. Fetch each url. Todo: How are redirects handled?
3. Output success and failures to the webtable. Note: fetchTime is set to 'now'.
4. If parsing is configured: parse the content (skipping unfetched items) using the ParseUtil class (also used by the ParseJob).
将随机的url放入抓取队列（每个域名（或主机名，IP）一对）遍历队列各项来抓取。
抓取每个url。如何处理重定向？
将成功和失败的项目输出到webpage表。注意：fetchTime设定为当前系统时间。
如果设置了解析：使用ParseUtil类（也是ParseJob调用的）解析内容（跳过没有抓取的项目）
Parse

Parses all webpages from a given batch id.

解析符合给定batchID的所有网页

Mapper
1. Reads every webpage from the webtable
2. Check if the fetch mark matches.
3. If this webpage has been parsed before (even in another batch) then skip it.
4. Run parsers on the row.
5. Output the row to the webtable.
从webpage表读取每个网页
检查是否有fetch标记
如果页面之前被解析过（即使是其他批次中解析的）则跳过。
在这一行中执行解析
输出这一行到webpage表

DbUpdate

Updates all rows with inlinks (backlinks), fetchtime and the correct score.

更新所有带inlink（backlink）的行，抓取时间和当前分数

Mapper
1. Read every row from the webtable.
2. Update the score for every outlink.
3. Output every outlink with score and anchor (linktext).
4. Output the rowkey itself with score.
从webpage表读取每一行
为每个outlink更新分数
输出每个outlink及其分数和锚（linktext）
输出行键本身和分数

Partition

Partition by {url}. Sort by {url,score}. Group by {url}. This ensures the inlinks are sorted by score in the reducer.

按照url分片，按照url和分数排序，按照url分组。这确保inlink在reducer中按分数排序。

Reduce
1. The key in the reducer is a row in the webtable and the reduce values are it’s inlinks.
2. Update the fetchtime.
3. Update the inlinks (capped by the property ‘db.update.max.inlinks’).
4. Update the score for a row based on it’s inlinks.
5. Output the row to the webtable.
reduce任务中的键对应webpage表中的一行，值对应它的inlink
更新fetchtime
更新inlink（被db.update.max.inlinks属性限制）
依据inlink更新一行的分数
将这行输出到webpage表

sunny6142496

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
nutch2.0抓取流程--nutch2crawling

原文http://wiki.apache.org/nutch/Nutch2Crawling第一次翻译可能有不合适的地方欢迎大家指正，红色字体部分是我还不太懂的地方This document describes the crawling jobs of Nutch2 (NutchGora).本文描述nutch2 gora版抓取作业Introduction简介A cra
复制链接

扫一扫