【Nutch】Linux下单机配置Nutch

最新推荐文章于 2021-05-10 18:21:01 发布

风声2012

最新推荐文章于 2021-05-10 18:21:01 发布

阅读量5.9k

点赞数

分类专栏： Nutch 文章标签： linux tomcat apache http服务器 java path

本文链接：https://blog.csdn.net/zklth/article/details/5618948

版权

Nutch 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

Linux下单机配置Nutch

1.环境介绍

操作系统：Red hat linux 9

Nutch版本：nutch-0.9,下载：http://apache.etoak.com/lucene/nutch/

JDK版本：JDK 1.6

Apache Tomcat版本：apache-tomcat-6.0.18

http://apache.etoak.com/tomcat/tomcat-6/v6.0.18/bin/apache-tomcat-6.0.18.tar.gz

2.配置前提

2.1 安装jdk 1.6

首先下载jdk安装包jdk-1_6_0_13-linux-i586-rpm.bin

第一步：# chmod +x jdk-1_6_0_13-linux-i586-rpm.bin (获得执行权限)

第二步：# ./jdk-1_6_0_13-linux-i586-rpm.bin (生成rpm安装包)

第三步：# rpm -ivh jdk-1_6_0_13-linux-i586.Rpm(安装ＪＤＫ)

安装完毕后，jdk默认安装在/usr/java/目录下。

第四步：配置JAVA环境变量。

在/etc/profile中设置环境变量

[root@red-hat-9 root]# vi /etc/profile

加入以下语句：

JAVA_HOME=/usr/java/jdk1.6.0_13

export JAVA_HOME

CLASSPATH=.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib (注意是冒号)

export CLASSPATH

PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH

[root@red-hat-9 root]# chmod +x /etc/profile (执行权限)

[root@red-hat-9 root]# source /etc/profile (此后设置有效)

2.2 安装Tombact

第一步：设置环境变量(不是必须的)

[root@red-hat-9 program]# vi /etc/profile

export JDK_HOME=$JAVA_HOME

[root@red-hat-9 program]# source /etc/profile

第二步：安装tomnact，解压到某目录下即可

tar xf apache-tomcat-6.0.18.tar.gz

mv apache-tomcat-6.0.18 /zkl/progaram/

第三步：如何使用Apache Tomcat

①首先启动Tomcat，只需执行以下命令

# /zkl/program/apache-tomcat-6.0.18/bin/startup.sh

②Tomcat的网页主目录是/zkl/program/apache-tomcat-6.0.18/webapps/，只需在webapps目录中添加相应网页即可在浏览器访问，Tomcat默认目录是webapps下的ROOT目录。

http://127.0.0.1:8080/ 访问tomcat默认主目录，ROOT

http://127.0.0.1:8080/luceneweb 将luceneweb放入webapps中

③Apache http服务器的端口是 80，http://127.0.0.1访问的是Apache主目录

Apache Tomcat服务器端口是8080，二者不冲突，若有冲突，则可以修改tomcat配置文件server.xml

vi /zkl/program/apache-tomcat-6.0.18/conf/server.xml

<Connector port="8080" maxHttpHeaderSize="8192"
      maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
      enableLookups="false" redirectPort="8443" acceptCount="100"
      c disableUploadTimeout="true"
      URIEncoding="UTF-8" useBodyEncodingForURI="true" />

默认服务端口为8080，若有冲突（如Apache），则可通过此配置文件更改端口（蓝色）;如果配置后nutch出现中文乱码问题，则增加编码配置（红色）

3.配置和应用Nutch

3.1 配置Nutch

首先，下载Nutch-0.9.tar.gz；

第一步，解压安装包

#tar zxvf Nutch-0.9.tar.gz
#mv Nutch-0.9 /zkl/ir/nutch-0.9

第二步，测试Nutch

# /zkl/ir/nutch-0.9/bin/nutch

若出现以下文字，则安装成功：

Usage: nutch COMMAND

where COMMAND is one of:

crawl one-step crawler for intranets

readdb read / dump crawl db

第三步，设置Nutch

①设置抓取网站的入口网址

[root@red-hat-9 nutch-0.9]# cd /zkl/ir/nutch-0.9/

[root@red-hat-9 nutch-0.9]# mkdir urls

[root@red-hat-9 nutch-0.9]# vi urls/urls_crawl.txt

或者不用创建目录，直接创建一个文件urls_crawl.txt，我们采用此法，

[root@red-hat-9 nutch-0.9]# vi urls_crawl.txt

写入要抓取(crawl)网站的入口网址，即从此入口开始抓取当前域名下的任何URL页面，例如：

http://english.gu.cas.cn/ag/

②指定爬取过滤规则

编辑nutch的URL过滤规则文件conf/crawl-urlfilter.txt

[root@red-hat-9 nutch-0.9]# vi conf/crawl-urlfilter.txt

修改

# accept hosts in MY.DOMAIN.NAME
# +^http://(/[a-z0-9]*/.)*MY.DOMAIN.NAME/

为

这是你想要爬取网站的域名，表示爬取当前网站下的所有URL页面，爬取起始网站在①中已经设置。

③过滤字符设置

如果爬取网站的url含有以下过滤字符，如 ? 和 = ，而你又需要这些访问，可以更改过滤表

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
改为
-[*!@]

④修改conf/nutch-site.xml

修改为

<configuration>
      <property>
            <name>http.agent.name</name>    http.agent.name属性
            <value>gucas.ac.cn</value>    该值是被抓取网站的名称，自己设定，

在nutch检索中会用到
      </property>
      <property>
            <name>http.agent.version</name>
            <value>1.0</value>
      </property>

<name>searcher.dir</name>

<value>/zkl/ir/nutch-0.9/gucas</value>

<description> Path to root of crawl</description>

</property>
</configuration>
如果没有配置此agent，爬取时会出现 Agent name not configured! 的错误。

⑤开始爬取

运行crawl命令抓取网站内容

[root@red-hat-9 nutch-0.9]# bin/nutch crawl urls_crawl.txt -dir gucas -depth 50 -threads 5 -topN 1000 >& logs/logs_crawl.log

·-dir dirnames 设置保存所抓取网页的目录.

·-depth depth 表明抓取网页的层次深度

·-delay delay 表明访问不同主机的延时，单位为“秒”

·-threads threads 表明需要启动的线程数

·-topN 1000 表明只抓取每一层的前N个URL

在上述命令的参数中，urls_crawl.txt 就是刚才创建的那个包含存储了抓取网址的文件urls_crawl.txt的目录；dir指定抓取内容所存放的目录，这里是gucas；depth表示以要抓取网站顶级网址为起点的爬行深度；threads指定并发的线程数；topN表明只抓取每一层的前N个URL；最后的logs/ logs_crawl.log表示把抓取过程显示的内容保存在logs目录下的文件logs_crawl.log中，以便分析程序的运行情况。

此命令运行完后，将会在nutch-0.9目录下生成gucas目录，并存放有抓取的文件和生成的索引，此外将会在nutch-0.9目录下剩下logs目录，此目录下生成一个文件logs_crawl.log，存储的是抓取日志。

如果gucas在运行前已存在，则运行时将报错：gucas already exist。建议先删除这个目录，或者指定其他的目录存放抓取的网页。

完成上述的各步操作，此时数据的抓取顺利完成了。

测试：bin/nutch org/apache/nutch/searcher/NutchBean the

查询关键字“the” 。

上面只是抓取单个网站，没有体现网络蜘蛛从多个网站爬取数据的优点，下面举例说明爬取多个网站数据时的情况：

在Nutch主目录下新建文件multiurls.txt文件，写入希望下载的网址列表

http://www.pcauto.com.cn/

http://www.xcar.com.cn/

http://auto.sina.com.cn

修改过滤规则文件crawl-urlfilter.txt，允许下载任意站点

# accept hosts in MY.DOMAIN.NAME

+^ //默认允许所有网站链接

# skip everything else

执行抓取命令

[root@red-hat-9 nutch-0.9]# bin/nutch crawl multiurls.txt -dir mutilweb -depth 50 -threads 5 -topN 1000 >& logs/logs_crawl.log

修改conf/nutch-site.xml

修改为：

<configuration>
      <property>
            <name>http.agent.name</name>     http.agent.name属性
            <value>* </value>                该值是网络指蜘蛛名称，
      </property>
      <property>
            <name>http.agent.version</name>
            <value>1.0</value>
      </property>

<name>searcher.dir</name>

<value>/zkl/ir/nutch-0.9/gucas</value>

<description> Path to root of crawl</description>

</property>
</configuration>
测试：bin/nutch org/apache/nutch/searcher/NutchBean SUV

查询关键字“SUV” 。

---------------------------------------------------------------------

⑥部署web前端
将nutch主目录下的nutch-0.9.war 包拷贝到tomcat的webapps目录下
[root@red-hat-9 nutch-0.9]# cp nutch-0.9.war /zkl/program/apache-tomcat-6.0.18/webapps/
然后浏览器网址 http://localhost:8080/nutch-0.9/ ，此时war包会自动解压，在tomcat的网页主目录webapps下会出现一个nutch-0.9文件夹。

⑦修改tomcat中nutch的web配置
vi /zkl/program/apache-tomcat-6.0.18/webapps//nutch-0.9/WEB-INF/classes/nutch-site.xml

将searcher.dir属性值更改为索引生成的目录。

<name>searcher.dir</name>

<value>/zkl/ir/nutch-0.9/gucas</value>

Path to root of crawl. This directory is searched (in

order) for either the file search-servers.txt, containing a list of

distributed search servers, or the directory "index" containing

merged indexes, or the directory "segments" containing segment

indexes.

</description>

</property>

</configuration>

3.2 应用Nutch(无结果仍未解决)

重启tomcat，

然后访问网址http://localhost:8080/nutch-0.9/

错误集锦：

①输入关键字，点击搜索发现出现错误

HTTP Status 500 -

--------------------------------------------------------------------------------

type Exception report

message

description The server encountered an internal error () that prevented it from fulfilling this request.

exception

org.apache.jasper.JasperException: /search.jsp(151,22) Attribute value  language + "/include/header.html" is quoted with " which must be escaped when used within the value

 org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:40)

这是因为 jsp2.0 的语法改变了，

根据提示把Tomcat目录下的webapps/nutch-0.9目录下的 serach.jsp 中的"<%=language+"/include /header.html"%>"/>

改成

'<%=language+"/include /header.html"%>'/> 就可以了。

②抓取时出现以下错误

[root@red-hat-9 nutch-0.9]# bin/nutch crawl urls -dir gucas -depth 50 –threads 5 >& logs/logs1.log

[root@red-hat-9 nutch-0.9]# cat logs/logs1.log

crawl started in: gucas

rootUrlDir = 5

threads = 10

depth = 50

Injector: starting

Injector: crawlDb: gucas/crawldb

Injector: urlDir: 5

Injector: Converting injected urls to crawl db entries.

Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path doesnt exist : /zkl/ir/nutch-0.9/5

at org.apache.hadoop.mapred.InputFormatBase.validateInput(InputFormatBase.java:138)

at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)

at org.apache.nutch.crawl.Injector.inject(Injector.java:162)

at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)

When you try the grep example in the QuickStart, you get an error like the following:

org.apache.hadoop.mapred.InvalidInputException:Input path doesnt exist : /user/ross/input

You haven't created an input directory containing one or more text files.

bin/hadoop dfs -put conf input

这是因为命令中–threads 5前面的– 输入法错误所导致，

风声2012

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
2
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录