在eclipse下编译运行nutch

最新推荐文章于 2021-01-19 09:51:26 发布

zhangxiang_nuaa_hust

最新推荐文章于 2021-01-19 09:51:26 发布

阅读量101

点赞数

分类专栏： Nutch专辑文章标签： Eclipse CVS Hadoop 配置管理项目管理

本文链接：https://blog.csdn.net/zhangxiang_nuaa_hust/article/details/83293538

版权

Nutch专辑专栏收录该内容

3 篇文章 0 订阅

订阅专栏

1. 下载Nutch 并解压到某个目录下，建议在根目录下。

2. 修改 Nutch\conf 目录下的 nutch-site.xml 和crawl-urlfilter.txt 两个文件，具体的修改方法如下：

(1) nutch-site.xml:

在<configuration></configuration>中间插入以下内容：

<name>http.agent.name</name>

<value>Jennifer</value>

<description>HTTP 'User-Agent' request header. MUST NOT be empty -

please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agents

http.agent.description

http.agent.url

http.agent.email

http.agent.version

and set their values appropriately.

</description>

</property>

<name>http.agent.description</name>

<value>Jennifer</value>

<description>Further description of our bot- this text is used in

the User-Agent header. It appears in parenthesis after the agent name.

</description>

</property>

<name>http.agent.url</name>

<value>Jennifer</value>

<description>A URL to advertise in the User-Agent header. This will

appear in parenthesis after the agent name. Custom dictates that this

should be a URL of a page explaining the purpose and behavior of this

crawler.

</description>

</property>

<name>http.agent.email</name>

<value>Jennifer</value>

<description>An email address to advertise in the HTTP 'From' request

header and User-Agent header. A good practice is to mangle this

address (e.g. 'info at example dot com') to avoid spamming.

</description>

</property>

并修改<value></value> 中间的值，这里的设置是因为Nutch 遵守了robots 协议，在获取response 时，把自己的相关信息提交给被爬行的网站，以供识别。所以设置成任何你喜欢的值都可以的。

(2) crawl-urlfilter.txt:

找到"+^http://([a-z0-9]*\.)* MY.DOMAIN.NAME/” ，将其中的' MY.DOMAIN.NAME/' 直接删除。

为了后面的方便，建议在修改完成后将conf 文件夹复制一下，并保存在硬盘的其他的地方。

3. 到下面两个页面去下载两个jar 文件，他们分别是：

http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/

http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/

下载其中的jid3lib-0.5.1.jar 和rtf-parser.jar 两个文件，并分别把他们拷贝到Nutch\src\plugin\parse-mp3\lib 和Nutch\src\plugin\parse-rtf\lib 两个文件夹下。

4. 准备工作做好以后就可以在Eclipse 中配置Nutch 了。打开Eclipse 后，我们开始建立Java project 。具体的做法是：

File > New > Project > Java project 创建Eclipse 项目，给我们的项目命名，然后选择Create project from existing source ，并指向你的Nutch 目录。

5. 点击 Next 后我们就能看到用于定义 Java build 的设置环境。在第一个选项卡 source 下面，选择conf，在下面可以看到三个连接，我们需要选择第三个： Add project ‘Nutch’ to build path ，完成以上步骤之后我们就将conf 加入到了classpath 中。注意：这里不是指上面的第三个选项卡。

6. 现在需要我们选择Default output folder ，在这里我们必须选择Nutch/conf 作为它的Default output folder ，否则就会找不到crawl-urlfilter.txt ，从而不能完成网站的爬行，就会出现我之前一只没有解决的问题，会出现以下提示：

Generator: 0 records selected for fetching, exiting ...

Stopping at depth=0 - no more URLs to fetch.

No URLs to fetch - check your seed list and URL filters.

7. 点击Finish ，这时我们就完成了前续的配置工作，这时我们就可点击运行，选择Java Application 点击OK ，开始让Eclipse 自己寻找项目的主类。我们选择Crawl- org.apache.nutch.crawl ，点击OK 。

8. 这时我们可以回到Nutch\conf 目录下看下，我们可以看到里面的内容发生了改变，这时就需要我们还原原来conf 下的内容，这时我们在第2 步中另外保存的conf 文件夹就起到了作用。（当然，新生成的东西并不影响我们的后续爬行工作，我们可以只考虑将原来conf 下的内容复制回来，但是我个人觉得为了文件夹的管理和查找方便，我建议将新生成的内容全部删除后在将原来conf 下的内容复制回来。还有一点需要我们注意，conf 里面不可以有org 文件夹，如果有的话delete ，否则的话会影响index 。）。在nutch下建立文件夹myURL，新建文本文件url.txt，键入爬行的起始URL，保存关闭。

9. 点击菜单中的Run > Open run dialog ，选择Java Application 分支，选中Arguments 选项卡。

在Program arguments 中填写爬行命令，如： myURL -dir myPages -depth 2 -topN 50

在VM arguments 中填写：-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log

填写完成后，点击apply 完成提交。由于我们到目前为止还没有建立需要我们爬行的url 地址，所以我们在提交完成后，点击close ，然后我们就可以建立需要Nutch 爬行的url 地址，建好之后，只要我们在点击工具栏上的运行按钮，我们就可以在Console 的透视图中看到Nutch 爬行的结果了。当然如果我们之前已经建立好了我们需要爬行的url 地址，我们就可以直接点击run ，这时我们也可以在Console 的透视图中看到Nutch 爬行的结果。

至此，我们的工作全部完成了，当然我建议在Eclipse 里面配置Nutch 之前，我们最好能够熟悉在Shell 的环境下运行Nutch 。如果有什么不对的地方，欢迎大家指正，我也很希望能和大家一起学习有关Nutch 的相关内容。