Nutch0.9安装手册(for Windows)

最新推荐文章于 2017-05-05 18:40:33 发布

tjsearchengine

最新推荐文章于 2017-05-05 18:40:33 发布

阅读量993

点赞数

分类专栏： NUTCH 文章标签： windows tomcat java jdk properties search

本文链接：https://blog.csdn.net/tjsearchengine/article/details/1796858

版权

NUTCH 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

如果机器上有以前版本的 JDK 和 Tomcat 请先将它们卸载，然后重启机器。

1.安装jdk

Jdk版本为1.6, 下载地址：
http://www.sun.com/download/
1）安装路径：d:/Program Files/Java/jdk 1.6.0 _02/

2）配置PATH环境变量，加入d:/Program Files/Java/jdk 1.6.0 _02/bin;
3）配置JAVA_HOME环境变量, d:/Program Files/Java/jdk1.6.0_02

2.安装Tomcat

版本为6.0, 下载地址：
http://tomcat.apache.org/
1）安装路径：D:/Program Files/Tomcat6

2）设置TOMCAT_HOME环境变量D:/Program Files/Tomcat6

3）把JAVA_HOME/lib/tools.jar拷到TOMCAT_HOME/lib下，重启动Tomcat（运行bin目录下的startup.bat）。
（或者，tomcat控制台java选项-〉Java classpath:
%tomcat_home%/bin/bootstrap.jar;%java_home%/lib/tools.jar
注：%java_home%、%tomcat_home分别指jdk、tomcat的安装根目录。）

4）在服务器的防火墙设置里面打开8080端口

3.安装Cygwin.

下载Cygwin，点击Cygwin/cyg_win_setup.exe进行安装。

（起码准备 2G 的空间）

安装路径：D:/Cygwin

选择Install from Local Directory

选择最基本的进行安装。(不选Graphics,Games,X11)

4.安装Nutch

1）下载nutch包，地址为http://lucene.apache.org/nutch/，大约 60M 。

2）将包nutch-0.9.tar.gz放到cygwin的安装目录根目录下。（例如D:/cygwin）

打开Cygwin的快捷方式，退到根目录，运行dir会看到nutch-0.9.tar.gz.

3）运行tar xvf nutch-0.9.tar.gz进行解包，会在根目录下面生成nutch-0.9文件夹。

4）将该文件改名, mv nutch-0.9 nutch

5）在nutch/bin下，建立urls目录，然后建立一个url.txt文件，在url.txt文件内写入一个希望爬行的url，例如：http://www.163.com
6）打开nutch/conf/crawl-urlfilter.txt文件，把MY.DOMAIN.NAME字符替换为url.txt内的url的域名，其实更简单点，直接删除MY.DOMAIN.NAME这几个字就可以了，也就是说，只保存+^http://([a-z0-9]*/.)*这几个字就可以了，表示所有http的网站都同意爬行。

7）打开nutch/conf/conf/nutch-site.xml文件，在<configuration></configuration>内插入以下内容：

<name>http.agent.name</name>

<value>nutchcvs</value>

<description>HTTP 'User-Agent' request header. MUST NOT be empty -

please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agents

http.agent.description

http.agent.url

http.agent.email

http.agent.version

and set their values appropriately.

</description>

</property>

<name>http.agent.description</name>

<description>Further description of our bot- this text is used in

the User-Agent header. It appears in parenthesis after the agent name.

</description>

</property>

<name>http.agent.url</name>

<description>A URL to advertise in the User-Agent header. This will

appear in parenthesis after the agent name. Custom dictates that this

should be a URL of a page explaining the purpose and behavior of this

crawler.

</description>

</property>

<name>http.agent.email</name>

<description>An email address to advertise in the HTTP 'From' request

header and User-Agent header. A good practice is to mangle this

address (e.g. 'info at example dot com') to avoid spamming.

</description>

</property>

把<name>XXX</name>之间的内容替换为其他字符，当然就算是不替换也无所谓，这里的设置，是因为nutch遵守了robots协议，在获取response时，把自己的相关信息提交给被爬行的网站，以供识别。

5.用nutch进行爬行
   由于配置nutch采用的是单独网站的配置方式，所以执行上我们也采用的是单网查询，全网查询在以后的内容中介绍。
   先看一看nutch给出的命令：nutch crawl urls -dir crawl -depth 3 -topN 50
   crawl：通知nutch.jar，执行crawl的main方法。
   urls：存放需要爬行的url.txt文件的目录，注意，这个名字需要和你的文件夹目录相同，如果你的文件夹为search，那这里也应该改成search。
   -dir crawl：爬行后文件保存的位置，可以在nutch/bin目录下找到。
   -depth 3：爬行次数，或者成为深度，不过还是觉得次数更贴切，建议测试时改为1。
   -topN 50：一个网站保存的最大页面数。

      执行命令的步骤：
      1）进入cygwin界面。
      2）使用cd命令，进入nutch/bin路径下。
      3）执行：sh nutch crawl urls -dir crawl -depth 3 -topN 50

具体的爬行日志可以在nutch/logs目录下看到，注意查找“INFO fetcher.Fetcher - fetching http://XXXXXXX”这样的内容，这里是抓去过程日志。

6.配置查询搜索
1）nutch提供了类似google、baidu的网页页面，在nutch压缩包下找到nutch-0.9.war文件，放到tomcat/webapps目录下。

2）修改webapps下的nutch-0.9为nutch
3）修改webapps/nutch/WEB-INF/classes/nutch-site.xml 文件内容如下：

<property>
<name>searcher.dir</name>
<value>D:/cygwin/nutch/bin/crawl</value>
</property>

<value/>的内容是刚才爬行后的crawl目录位置，提供给客户端来查询。