Nutch开发(一)

Nutch 开发(一)

开发环境
  • Linux,Ubuntu20.04LST
  • IDEA
  • Nutch1.18
  • Solr8.11

转载请声明出处!!!By 鸭梨的药丸哥

1.IDEA 导入nutch项目

要开发nutch最好连nutch源码一起下载下来。去官方下载nutch的源码包。

1.18版本的下载地址:https://www.apache.org/dyn/closer.lua/nutch/1.18/apache-nutch-1.18-src.tar.gz

我下载的是Linux的源码包,因为nutch很多命令都需要运行在Linux上面,所以为了方便我是在Linux上对nutch的插件进行开发。

编译源码前,确保已经安装好ant,可以执行下面的方法进行ant的安装

sudo apt-get update
sudo apt-get install ant

将nutch构建成eclipse项目

ant eclipse

然后使用IDEA以eclipse工程导入项目,这个网上的资源比较多,正常滴导入Nutch源码项目即可,导入时选择以eclipse项目的方式进行导入。

2.nutch源码目录了解

通过nutch源码编译出来的目录结构跟下载的bin包的结构目录有细微的差异

build/    #ant eclipse编译后的生成的
conf/     #配置文件目录
docs/     #接口文档
ivy/      #ivy依赖管理工具的文件夹
lib/      #放置Hadoop本机库的占位符的文件夹(不会自动下载,里面的组件用来加快数据(反)压缩)
src/      #源码目录

3.Nutch爬取步骤

Nutch整个爬取过程是分很多步骤的:

  • injector -> generator -> fetcher -> parseSegment -> updateCrawleDB -> Invert links -> Index -> DeleteDuplicates -> IndexMerger
  1. 建立初始URL集

  2. 执行inject ,将URL集注入crawldb数据库

  3. 执行generate,根据crawldb数据库创建抓取列表

  4. 执行fetch,获取网页信息

4.2)执行parse,解析网页信息

  1. 执行updatedb ,把获取到的页面信息存入数据库中

  2. 重复进行3~5的步骤,直到预先设定的抓取深度。—“产生/抓取/更新”循环

  3. 执行invertlinks ,根据sengments的内容更新linkdb数据库

  4. 建立索引—index (如:在solr中建立索引)

Nutch作者画的一个Nutch架构图,以前较老版本的架构,当初nutch还未吧全文检索功能分离出来

请添加图片描述

4.启动类的介绍

主要的启动类如下:

OperationClass in Nutch 1.x (i.e.trunk)Class in Nutch 2.x
injectorg.apache.nutch.crawl.Injectororg.apache.nutch.crawl.InjectorJob
generateorg.apache.nutch.crawl.Generatororg.apache.nutch.crawl.GeneratorJob
fetchorg.apache.nutch.fetcher.Fetcherorg.apache.nutch.fetcher.FetcherJob
parseorg.apache.nutch.parse.ParseSegmentorg.apache.nutch.parse.ParserJob
updatedborg.apache.nutch.crawl.CrawlDborg.apache.nutch.crawl.DbUpdaterJob
invertlinksorg.apache.nutch.crawl.LinkDb???

5.Nutch的sh脚本

重Nutch的sh脚本可以发现,nutch脚本的本质还是调用具体的启动类来实现其功能。

下面截取sh的部分片段,可以看出不同的COMMAND对应不同的启动类,然后将命令行的参数传递给启动类。

# figure out which class to run
if [ "$COMMAND" = "crawl" ] ; then
  echo "Command $COMMAND is deprecated, please use bin/crawl instead"
  exit -1
elif [ "$COMMAND" = "inject" ] ; then
  CLASS=org.apache.nutch.crawl.Injector
elif [ "$COMMAND" = "generate" ] ; then
  CLASS=org.apache.nutch.crawl.Generator
elif [ "$COMMAND" = "freegen" ] ; then
  CLASS=org.apache.nutch.tools.FreeGenerator
elif [ "$COMMAND" = "fetch" ] ; then
  CLASS=org.apache.nutch.fetcher.Fetcher
elif [ "$COMMAND" = "parse" ] ; then
  CLASS=org.apache.nutch.parse.ParseSegment
elif [ "$COMMAND" = "readdb" ] ; then
  CLASS=org.apache.nutch.crawl.CrawlDbReader
elif [ "$COMMAND" = "mergedb" ] ; then
  CLASS=org.apache.nutch.crawl.CrawlDbMerger
elif [ "$COMMAND" = "readlinkdb" ] ; then
  CLASS=org.apache.nutch.crawl.LinkDbReader
elif [ "$COMMAND" = "readseg" ] ; then
  CLASS=org.apache.nutch.segment.SegmentReader
elif [ "$COMMAND" = "mergesegs" ] ; then
  CLASS=org.apache.nutch.segment.SegmentMerger
elif [ "$COMMAND" = "updatedb" ] ; then
  CLASS=org.apache.nutch.crawl.CrawlDb
elif [ "$COMMAND" = "invertlinks" ] ; then
  CLASS=org.apache.nutch.crawl.LinkDb
elif [ "$COMMAND" = "mergelinkdb" ] ; then
  CLASS=org.apache.nutch.crawl.LinkDbMerger
elif [ "$COMMAND" = "dump" ] ; then
  CLASS=org.apache.nutch.tools.FileDumper
elif [ "$COMMAND" = "commoncrawldump" ] ; then
  CLASS=org.apache.nutch.tools.CommonCrawlDataDumper
elif [ "$COMMAND" = "solrindex" ] ; then
  CLASS="org.apache.nutch.indexer.IndexingJob -D solr.server.url=$1"
  shift
elif [ "$COMMAND" = "index" ] ; then
  CLASS=org.apache.nutch.indexer.IndexingJob
elif [ "$COMMAND" = "solrdedup" ] ; then
  echo "Command $COMMAND is deprecated, please use dedup instead"
  exit -1
elif [ "$COMMAND" = "dedup" ] ; then
  CLASS=org.apache.nutch.crawl.DeduplicationJob
elif [ "$COMMAND" = "solrclean" ] ; then
  CLASS="org.apache.nutch.indexer.CleaningJob -D solr.server.url=$2 $1"
  shift; shift
elif [ "$COMMAND" = "clean" ] ; then
  CLASS=org.apache.nutch.indexer.CleaningJob
elif [ "$COMMAND" = "parsechecker" ] ; then
  CLASS=org.apache.nutch.parse.ParserChecker
elif [ "$COMMAND" = "indexchecker" ] ; then
  CLASS=org.apache.nutch.indexer.IndexingFiltersChecker
elif [ "$COMMAND" = "filterchecker" ] ; then
  CLASS=org.apache.nutch.net.URLFilterChecker
elif [ "$COMMAND" = "normalizerchecker" ] ; then
  CLASS=org.apache.nutch.net.URLNormalizerChecker
elif [ "$COMMAND" = "domainstats" ] ; then 
  CLASS=org.apache.nutch.util.domain.DomainStatistics
elif [ "$COMMAND" = "protocolstats" ] ; then
   CLASS=org.apache.nutch.util.ProtocolStatusStatistics
elif [ "$COMMAND" = "crawlcomplete" ] ; then
  CLASS=org.apache.nutch.util.CrawlCompletionStats
elif [ "$COMMAND" = "webgraph" ] ; then
  CLASS=org.apache.nutch.scoring.webgraph.WebGraph
elif [ "$COMMAND" = "linkrank" ] ; then
  CLASS=org.apache.nutch.scoring.webgraph.LinkRank
elif [ "$COMMAND" = "scoreupdater" ] ; then
  CLASS=org.apache.nutch.scoring.webgraph.ScoreUpdater
elif [ "$COMMAND" = "nodedumper" ] ; then
  CLASS=org.apache.nutch.scoring.webgraph.NodeDumper
elif [ "$COMMAND" = "plugin" ] ; then
  CLASS=org.apache.nutch.plugin.PluginRepository
elif [ "$COMMAND" = "junit" ] ; then
  CLASSPATH="$CLASSPATH:$NUTCH_HOME/test/classes/"
  if $local; then
    for f in "$NUTCH_HOME"/test/lib/*.jar; do
      CLASSPATH="${CLASSPATH}:$f";
    done
  fi
  CLASS=org.junit.runner.JUnitCore
elif [ "$COMMAND" = "startserver" ] ; then
  CLASS=org.apache.nutch.service.NutchServer
elif [ "$COMMAND" = "webapp" ] ; then
  CLASS=org.apache.nutch.webui.NutchUiServer
elif [ "$COMMAND" = "warc" ] ; then
  CLASS=org.apache.nutch.tools.warc.WARCExporter
elif [ "$COMMAND" = "updatehostdb" ] ; then
  CLASS=org.apache.nutch.hostdb.UpdateHostDb
elif [ "$COMMAND" = "readhostdb" ] ; then
  CLASS=org.apache.nutch.hostdb.ReadHostDb
elif [ "$COMMAND" = "sitemap" ] ; then
  CLASS=org.apache.nutch.util.SitemapProcessor
elif [ "$COMMAND" = "showproperties" ] ; then
  CLASS=org.apache.nutch.tools.ShowProperties
else
  CLASS=$COMMAND
fi

6.运行injector

inject的主函数在org.apache.nutch.crawl包的injector类中。

6.1 配置

要运行inject,首先要apache-nutch-1.18/conf/nutch-site.xml添加plugin.folders配置,用来覆盖掉默认的相对路径的配置。因为使用nutch脚本的运行路径和我们直接用源码运行的路径是不同的。

<property>  
  <name>plugin.folders</name>  
  <value>/home/liangwy/IdeaProjects/apache-nutch-1.18/src/plugin</value>  
  <description>Directories where nutch plugins are located.  Each  
  element may be a relative or absolute path.  If absolute, it is used  
  as is.  If relative, it is searched for on the classpath.</description>  
</property>  
6.2创建一个url列表
mkdir urls
touch urls/seeds.txt
vim urls/seeds.txt
#然后输入要第一批进行爬取的url即可
6.3 IDEA创建启动

点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :

  • Name : Injector
  • Main Class :org.apache.nutch.crawl.Injector (1.x版本的主函数类,具体名字要看源码2.x叫InjectorJob)
  • VM options :-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
  • Program arguments : /home/User/IdeaProjects/apache-nutch-1.18/myNutch/crawldb /home/User/apache-nutch-1.18/urls (存储抓取地址文件seed.txt的目录)

注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的

6.4 运行效果对等

运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。

./nutch inject /home/User/IdeaProjects/apache-nutch-1.18/myNutch/crawldb /home/liangwy/apache-nutch-1.18/urls

7.Injector主函数分析

injector的main函数如下:

public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(NutchConfiguration.create(), new Injector(), args);
    System.exit(res);
}

Injector的运行是通过ToolRunner进行的,点开ToolRunner的run函数,发现最后运行的实际调用方法是injector的run函数。

方法参数:

  • Configuration conf #nutch的配置
  • Tool tool #要运行的工具类(如:injector,generator)
  • String[] args #传递给工具类的命令行参数
public static int run(Configuration conf, Tool tool, String[] args) throws Exception {
        if (CallerContext.getCurrent() == null) {
            CallerContext ctx = (new Builder("CLI")).build();
            CallerContext.setCurrent(ctx);
        }

        if (conf == null) {
            conf = new Configuration();
        }
		//解析配置
        GenericOptionsParser parser = new GenericOptionsParser(conf, args);
        tool.setConf(conf);
        String[] toolArgs = parser.getRemainingArgs();
        //实际运行还是调用tool自身的run
        return tool.run(toolArgs);
    }

8.运行Generator

8.1 IDEA创建启动

点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :

  • Name : Generator
  • Main Class :org.apache.nutch.crawl.Generator
  • Program arguments : /home/User/IdeaProjects/apache-nutch-1.18/myNutch/crawldb /home/User/IdeaProjects/apache-nutch-1.18/myNutch/segments -topN 100

注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的

8.2 运行效果对等

运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。

./nutch generate /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/crawldb /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments -topN 100

9.运行Fetcher

9.1 IDEA创建启动

点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :

  • Name : Fetcher
  • Main Class :org.apache.nutch.fetcher.Fetcher
  • Program arguments : /home/User/IdeaProjects/apache-nutch-1.18/myNutch/segments/20220114175955 -threads 16

注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的

9.2 报错分析

没有配置http.agent.name,这个配置可以在conf/nutch-site.xml中进行配置

Fetcher: No agents listed in ‘http.agent.name’ property.
Fetcher: java.lang.IllegalArgumentException: Fetcher: No agents listed in ‘http.agent.name’ property.
at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:563)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:431)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:545)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:518)

9.3 配置http.agent.name

在conf/nutch-site.xml文件中添加如下配置

property>
    <name>http.agent.name</name>
    <value>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.43</value>
    <description>HTTP 'User-Agent' request header. MUST NOT be empty -
      please set this to a single word uniquely related to your organization.
      NOTE: You should also check other related properties:
      http.robots.agents
      http.agent.description
      http.agent.url
      http.agent.email
      http.agent.version
      and set their values appropriately.
    </description>
  </property>
  <property>
    <name>http.robots.agents</name>
    <value>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.43,*</value>
  </property>
9.3 运行效果对等

运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。

./nutch fetch /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments/20220114175955 -threads 16

10.运行ParseSegment

10.1 IDEA创建启动

点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :

  • Name : ParseSegment
  • Main Class :org.apache.nutch.parse.ParseSegment
  • Program arguments : /home/User/IdeaProjects/apache-nutch-1.18/myNutch/segments/20220114175955

注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的

10.2 运行效果对等

运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。

./nutch parse /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments/20220114175955

11.运行CrawlDb

11.1 IDEA创建启动

点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :

  • Name : CrawlDb
  • Main Class :org.apache.nutch.crawl.CrawlDb
  • Program arguments : /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/crawldb/ -dir /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments

注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的

11.2 运行效果对等

运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。

./nutch updatedb /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/crawldb/ -dir /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments

12.运行LinkDb

12.1 IDEA创建启动

点击主菜单依次选择: Run -> Edit Configurations ,点击 + 号,选择创建 Application :

  • Name : LinkDb
  • Main Class :org.apache.nutch.crawl.LinkDb
  • Program arguments : /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/linkdb -dir /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments/

注意:Program arguments的填充是跟你nutch提供的脚本传递的参数一样的

12.2 运行效果对等

运行效果等价于使用nutch(Bin版本)的bin/目录下的nutch命令一样。

/nutch invertlinks /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/linkdb -dir /home/liangwy/IdeaProjects/apache-nutch-1.18/myNutch/segments/
下一章

下一章,教如何将这些步骤进行整合。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值