Nutch 教程

当前位置: 技术翻译 »   服务器端开发 »  中英文对照

Nutch 教程

返回原文英文原文:NutchTutorial

Introduction

Apache Nutch is an open source Web crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over. That’s where Apache Solr comes in. Solr is an open source full text search framework, with Solr we can search the visited pages from Nutch. Luckily, integration between Nutch and Solr is pretty straightforward as explained below.

Apache Nutch supports Solr out-the-box, greatly simplifying Nutch-Solr integration. It also removes the legacy dependence upon both Apache Tomcat for running the old Nutch Web Application and upon Apache Lucene for indexing. Just download a binary release from here.

译者信息

介绍

Apache Nutch是一个用Java编写的开源网络爬虫。通过它,我们就能够自动地找到网页中的超链接,从而极大地减轻了维护工作的负担,例如检查那些已经断开了的链接,或是对所有已经访问过的网页创建一个副本以便用于搜索。接下来就是Apache Solr所要做的。Solr是一个开源的全文搜索框架,通过Solr我们能够搜索Nutch已经访问过的网页。幸运的是,关于Nutch和Solr之间的整合在下方已经解释得相当清楚了。

Apache Nutch对于Solr已经支持得很好,这大大简化了Nutch与Solr的整合。这也消除了过去依赖于Apache Tomcat来运行老的Nutch网络应用以及依赖于Apache Lucene来进行索引的麻烦。只需要从这里下载一个二进制的发行版即可。

Steps

This tuturial describes the installation and use of Nutch 1.x (current release is 1.5.1). How to compile and set up Nutch 2.x, see Nutch2Tutorial.

1. Setup Nutch from binary distribution

  • Download a binary package (apache-nutch-1.X-bin.zip) from here.

  • Unzip your binary Nutch package. There should be a folderapache-nutch-1.X.

  • cd apache-nutch-1.X/

From now on, we are going to use${NUTCH_RUNTIME_HOME}to refer to the current directory (apache-nutch-1.X/).

Set up from the source distribution

Advanced users may also use the source distribution:

  • Download a source package (apache-nutch-1.X-src.zip)

  • Unzip
  • cd apache-nutch-1.X/

  • Runantin this folder (cf. RunNutchInEclipse)

  • Now there is a directoryruntime/localwhich contains a ready to use Nutch installation.

When the source distribution is used${NUTCH_RUNTIME_HOME}refers toapache-nutch-1.X/runtime/local/. Note that

  • config files should be modified inapache-nutch-1.X/runtime/local/conf/

  • ant cleanwill remove this directory (keep copies of modified config files)

译者信息

步骤

  • 这篇教程描述了Nutch 1.x(当前版本是1.6)的安装和使用。关于如何编译和安装Nutch 2.x,请查看Nutch2Tutorial
1.从二进制发行包安装Nutch
  • 这里下载二进制包(apache-nutch-1.X-bin.zip)。
  • 解压缩您的Nutch包。那应该会有一个新文件夹apache-nutch-1.X
  • cd apache-nutch-1.X/

从现在开始,我们将会使用${NUTCH_RUNTIME_HOME}来代替当前目录(apache-nutch-1.X/)。

从源代码安装Nutch

高级用户也可能会使用源代码发行包:

  • 下载一个源代码包(apache-nutch-1.X-src.tar.gz
  • 解压缩
  • cd apache-nutch-1.X/
  • 在这个目录里运行ant(参见:RunNutchInEclipse
  • 现在那会有一个目录runtime/local,它包含了准备使用的Nutch安装

当使用源代码包时,我们会用${NUTCH_RUNTIME_HOME}代替目录apache-nutch-1.X/runtime/local/。记住这些:

  • 配置文件在apache-nutch-1.X/runtime/local/conf/目录里面
  • ant clean将会移除这个目录(并保留被更改的配置文件的备份)

2. Verify your Nutch installation

  • run "bin/nutch" - You can confirm a correct installation if you seeing the following:

Usage: nutch [-core] COMMAND

Some troubleshooting tips:

  • Run the following command if you are seeing "Permission denied":

chmod +x bin/nutch
  • SetupJAVA_HOMEif you are seeingJAVA_HOMEnot set. On Mac, you can run the following command or add it to~/.bashrc:

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home

译者信息

2.检验您的Nutch安装

  • 运行”bin/nutch“。如果您能看见下列内容说明您的安装是正确的:

Usage: nutch [-core] COMMAND

一些解决问题的提示:

  • 如果您看见”Permission denied”那么请运行下列命令:

chmod +x bin/nutch

  • 如果您看见JAVA_HOME没有设置那么请设置JAVA_HOME环境变量。在Mac上,您可以运行下述命令或者把它添加到~/.bashrc里面去:

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home

3. Crawl your first website

  • Add your agent name in thevaluefield of thehttp.agent.nameproperty inconf/nutch-site.xml, for example:

<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>
  • mkdir -p urls

  • cd urls

  • touch seed.txtto create a text fileseed.txtunderurls/with the following content (one URL per line for each site you want Nutch to crawl).

http://nutch.apache.org/
  • Edit the fileconf/regex-urlfilter.txtand replace

# accept anything else
+.

with a regular expression matching the domain you wish to crawl. For example, if you wished to limit the crawl to thenutch.apache.orgdomain, the line should read:

 +^http://([a-z0-9]*\.)*nutch.apache.org/

This will include any URL in the domainnutch.apache.org.

3.1 Using the Crawl Command

Now we are ready to initiate a crawl, use the following parameters:

  • -dir dir names the directory to put the crawl in.

  • -threads threads determines the number of threads that will fetch in parallel.

  • -depth depth indicates the link depth from the root page that should be crawled.

  • -topN N determines the maximum number of pages that will be retrieved at each level up to the depth.

  • Run the following command:

bin/nutch crawl urls -dir crawl -depth 3 -topN 5
  • Now you should be able to see the following directories created:

crawl/crawldb
crawl/linkdb
crawl/segments

NOTE: If you have a Solr core already set up and wish to index to it, you are required to add the-solr <solrUrl> parameterto yourcrawlcommand e.g.

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

If not then please skip to here for how to set up your Solr instance and index your crawl data.

Typically one starts testing one's configuration by crawling at shallow depths, sharply limiting the number of pages fetched at each level (-topN), and watching the output to check that desired pages are fetched and undesirable pages are not. Once one is confident of the configuration, then an appropriate depth for a full crawl is around 10. The number of pages per level (-topN) for a full crawl can be from tens of thousands to millions, depending on your resources.

译者信息

3.抓取您的第一个网站

  • 将您的代理的名称添加到conf/nutch-site.xmlhttp.agent.name属性的Value字段里,例如:

<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>

  • mkdir -p urls
  • cd urls
  • touch seed.txt

这样子就在urls/目录下创建了一个文本文档seed.txt。它需要包含像下面这样的内容(每行一个网站URL来告诉Nutch您想要抓取的网站):

http://nutch.apache.org/

编辑文件conf/regex-urlfilter.txt并且替换

# accept anything else
+.

为一条与您要抓取的域名相对应的正则表达式。例如,如果您想要限制为抓取nutch.apache.org这一域名,这一行读起来应该像是这样子的:

+^http://([a-z0-9]*\.)*nutch.apache.org/

这会包括在nutch.apache.org下的任何URL。

3.1使用抓取命令

现在我们已经准备好开始一次抓取,可以使用以下的参数:

  • -dir dir 指定用于存放抓取文件的目录名称。
  • -threads threads 决定将会在获取是并行的线程数。
  • -depth depth 表明从根网页开始那应该被抓取的链接深度。
  • -topN N 决定在每一深度将会被取回的网页的最大数目
  • 运行下面的命令:

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

  • 现在您应该能够看见下列目录被创建了:

crawl/crawldb
crawl/linkdb
crawl/segmentsThis

  • 请记住:如果您有一个已经设置好了的Solr并且想要建立索引到那里面去,您必须添加-solr <solrUrl>参数到您的crawl命令里面。例如:

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

  • 然后请直接跳到后面–为搜索设置Solr 

通常一开始测试一个配置都是通过抓取在较浅深度来进行,大大地限制了每一级所获取的网页数(-topN),并且观察输出来检查所需要的页面是否已经得到以及不需要的页面是否被阻挡。要想查看某一配置是否正确,对于全文搜索来说较为适当的深度设置大约是10左右。每一级所获取的网页数 ( -topN)可以从几万上到几百万,这取决于您的资源。

3.2 Using Individual Commands for Whole-Web Crawling

NOTE: If you previously modified the fileconf/regex-urlfilter.txtas covered here you will need to change it back.

Whole-Web crawling is designed to handle very large crawls which may take weeks to complete, running on multiple machines. This also permits more control over the crawl process, and incremental crawling. It is important to note that whole Web crawling does not necessarily mean crawling the entire World Wide Web. We can limit a whole Web crawl to just a list of the URLs we want to crawl. This is done by using a filter just like we the one we used when we did thecrawlcommand (above).

译者信息
3.2使用特别的命令对整个网络进行抓取
  • 请记住:如果您先前更改并覆盖了文件conf/regex-urlfilter.txt在这里您需要将它改回去。

整个网络的抓取被设计成用来处理那些可能需要耗费几个星期来完成,在许多台机器上运行的非常大的抓取。这也允许在抓取的过程中进行更多的控制,还有增量抓取。最重要的是要记住整个网络的抓取并不一定意味着要抓取整个万维网。我们可以限制整个网络的抓取只是抓取我们列出的想要抓取的URL。这是通过使用一个就像我们用 crawl命令时一样的过滤器来完成的。

Step-by-Step: Concepts

Nutch data is composed of:

  1. The crawl database, or crawldb. This contains information about every URL known to Nutch, including whether it was fetched, and, if so, when.
  2. The link database, or linkdb. This contains the list of known links to each URL, including both the source URL and anchor text of the link.
  3. A set of segments. Each segment is a set of URLs that are fetched as a unit. Segments are directories with the following subdirectories:
    • crawl_generate names a set of URLs to be fetched

    • crawl_fetch contains the status of fetching each URL

    • content contains the raw content retrieved from each URL

    • parse_text contains the parsed text of each URL

    • parse_data contains outlinks and metadata parsed from each URL

    • crawl_parse contains the outlink URLs, used to update the crawldb

译者信息
循序渐进之–概念

Nutch数据是由这些组成的:

  • 抓取数据库,或者说是crawldb。它包含了关于每一个Nutch已知的URL的信息,包括它是否已经被获取,甚至是何时被获取的。
  • 链接数据库,或者说是linkdb。它包含了每一个已知URL的链接,包括源的URL以及链接的锚文本。
  • 一系列的分段,或者说是segments。每一个segments都是一组被作为一个单元来获取的URL。segments是它本身这个目录以及它下面的子目录:
    • 一个crawl_generate确定了将要被获取的一组URL;
    • 一个crawl_fetch包含了获取的每个URL的状态;
    • 一个content包含了从每个URL获取回来的原始的内容;
    • 一个parse_text包含了每个URL解析以后的文本;
    • 一个parse_data包含来自每个URL被解析后内容中的外链和元数据;
    • 一个crawl_parse包含了外链的URL,用来更新crawldb。
Step-by-Step: Seeding the crawldb with a list of URLs
Option 1: Bootstrapping from the DMOZ database.

The injector adds URLs to the crawldb. Let's inject URLs from the DMOZ Open Directory. First we must download and uncompress the file listing all of the DMOZ pages. (This is a 200+ MB file, so this will take a few minutes.)

wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz

Next we select a random subset of these pages. (We use a random subset so that everyone who runs this tutorial doesn't hammer the same sites.) DMOZ contains around three million URLs. We select one out of every 5,000, so that we end up with around 1,000 URLs:

mkdir dmoz
bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls

The parser also takes a few minutes, as it must parse the full file. Finally, we initialize the crawldb with the selected URLs.

bin/nutch inject crawl/crawldb dmoz

Now we have a Web database with around 1,000 as-yet unfetched URLs in it.

Option 2. Bootstrapping from an initial seed list.

This option shadows the creation of the seed list as covered here.

bin/nutch inject crawl/crawldb urls

译者信息
循序渐进之–用一组URL列表确定crawldb
选择1:从DMOZ数据库自举。

由injector添加URL到crawldb里。让我们从DMOZ开放式分类目录添加URL吧。首先我们必须下载并且解压缩这个DMOZ所有网页的列表(这是一个200多MB的文件,所以这会消耗几分钟)。

wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz

接下来我们选择这些网页当中随机的一些子集(我们使用随机的子集所以所有在跟着这一个教程做的人就不会伤害到同样的网站)。DMOZ包含了大约三百万个URL。我们从每5000个URL中选择出一个,因此我们就有大约1000个URL:

mkdir dmoz
bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls

这一分析器也需要几分钟来完成,因为它必须要分析整个文件。最后,我们用这些选出的URL来初始化crawldb。

bin/nutch inject crawl/crawldb dmoz

现在我们有了一个大约有1000个未被获取的URL的网络数据库。

选择2:从初始列表里自举。

这一选项不为人们所了解的地方在于创建初始列表并覆盖在urls/目录里。

bin/nutch inject crawl/crawldb urls

Step-by-Step: Fetching

To fetch, we first generate a fetch list from the database:

bin/nutch generate crawl/crawldb crawl/segments

This generates a fetch list for all of the pages due to be fetched. The fetch list is placed in a newly created segment directory. The segment directory is named by the time it's created. We save the name of this segment in the shell variables1:

s1=`ls -d crawl/segments/2* | tail -1`
echo $s1

Now we run the fetcher on this segment with:

bin/nutch fetch $s1

Then we parse the entries:

bin/nutch parse $s1

When this is complete, we update the database with the results of the fetch:

bin/nutch updatedb crawl/crawldb $s1

Now the database contains both updated entries for all initial pages as well as new entries that correspond to newly discovered pages linked from the initial set.

Now we generate and fetch a new segment containing the top-scoring 1,000 pages:

bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2

bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb crawl/crawldb $s2

Let's fetch one more round:

bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s3=`ls -d crawl/segments/2* | tail -1`
echo $s3

bin/nutch fetch $s3
bin/nutch parse $s3
bin/nutch updatedb crawl/crawldb $s3

By this point we've fetched a few thousand pages. Let's index them!

译者信息
循序渐进之–获取

要获取,我们首先要从数据库里产生一个获取的列表。

bin/nutch generate crawl/crawldb crawl/segments

这会为所有预定要被获取的网页产生一个获取列表。获取列表放在一个新创建的分段目录里。分段目录的名称取决于它被创建时的时间。

我们将这个分段的名字放在shell的变量s1里面:

s1=`ls -d crawl/segments/2* | tail -1`
echo $s1

现在我们能以下面的命令在这个分段里进行获取:

bin/nutch fetch $s1

然后我们就能解析条目:

bin/nutch parse $s1

当这一切完成以后,我们就以获取回来的结果更新数据库:

bin/nutch updatedb crawl/crawldb $s1

现在,数据库包含了刚刚更新的条目的所有初始页,除此之外,新的网页条目对于链接到初始的集合来进行新条目的发现是相符合的。
所以我们对包含得分最高的1000页提取出来产生一个新的分段:

bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2
bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb crawl/crawldb $s2

让我们再来获取一次吧:

bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s3=`ls -d crawl/segments/2* | tail -1`
echo $s3
bin/nutch fetch $s3
bin/nutch parse $s3
bin/nutch updatedb crawl/crawldb $s3

通过这一点我们已经获取了几千页的网页。让我们索引它们吧!

Step-by-Step: Invertlinks

Before indexing we first invert all of the links, so that we may index incoming anchor text with the pages.

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

We are now ready to search with Apache Solr.

4. Setup Solr for search

  • download binary file from here

  • unzip to$HOME/apache-solr-3.X, we will now refer to this as${APACHE_SOLR_HOME}

  • cd ${APACHE_SOLR_HOME}/example

  • java -jar start.jar

5. Verify Solr installation

After you started Solr admin console, you should be able to access the following links:

http://localhost:8983/solr/admin/
http://localhost:8983/solr/admin/stats.jsp

译者信息
循序渐进之–反向链接

在我们进行索引之前,我们首先要反转所有的链接,以便我们能够以这些网页来索引进入的锚文本。

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

我们现在准备好要用Apache Solr进行搜索了。

4.为搜索设置Solr

  • 这里下载二进制文件。
  • 解压缩到$HOME/apache-solr-3.X,从现在起,我们将会用${APACHE_SOLR_HOME}代替它。
  • cd ${APACHE_SOLR_HOME}/example
  • java -jar start.jar

5.检验Solr的安装

在您启动Solr管理员控制台以后,您应该能够访问下列这些链接:

http://localhost:8983/solr/admin/
http://localhost:8983/solr/admin/stats.jsp

6. Integrate Solr with Nutch

We have both Nutch and Solr installed and setup correctly. And Nutch already created crawl data from the seed URL(s). Below are the steps to delegate searching to Solr for links to be searchable:

  • cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/

  • restart Solr with the command “java -jar start.jar” under${APACHE_SOLR_HOME}/example

  • run the Solr Index command:

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*

The call signature for running the solrindex has changed. The linkdb is now optional, so you need to denote it with a "-linkdb" flag on the command line.

This will send all crawl data to Solr for indexing. For more information please see bin/nutch solrindex

If all has gone to plan, we are now ready to search with http://localhost:8983/solr/admin/. If you want to see the raw HTML indexed by Solr, change the content field definition inschema.xmlto:

<field name="content" type="text" stored="true" indexed="true"/>

译者信息

6.将Solr与Nutch进行整合

我们已经将Nutch和Solr正确地安装设置好了。并且Nutch已经从URL列表里创建并抓取了数据。以下步骤是一个以Solr来搜索要搜索的链接的代表:

  • cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/
  • 在目录${APACHE_SOLR_HOME}/example下使用命令”java -jar start.jar“来重启Solr
  • 运行Solr索引命令:

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*

运行solrindex的一些细节已经被改变了。linkdb现在是可选的,所以您需要在命令行中用一个”-linkdb”明确地表示它。
这会发送所有的抓取数据给Solr进行索引。更多信息请运行命令bin/nutch solrindex。
如果一切顺利,我们现在已经准备好在http://localhost:8983/solr/admin/进行搜索。如果您想要看到有Solr创建的原始HTML索引,您需要更改schema.xml当中定义的content字段为:

<field name="content" type="text" stored="true" indexed="true"/>

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值