Nutch0.9安装手册(windows)

最新推荐文章于 2024-10-15 06:48:22 发布

chris1081

最新推荐文章于 2024-10-15 06:48:22 发布

阅读量499

点赞数

分类专栏： Nutch 文章标签： windows tomcat search java 浏览器 jdk

本文链接：https://blog.csdn.net/chris1081/article/details/2840895

版权

Nutch 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

一、安装JDK

1、下载地址：http://java.sun.com/javase/downloads/index.jsp

2、安装目录：安装目录自己选择，这里假设为：d:/Program Files/Java/jdk/

3、设置Path环境变量：右键我的电脑--属性--高级--环境变量--系统变量编辑Path变量增加目录d:/Program Files/Java/jdk/bin

4、设置JAVA_HOME环境变量变量：同上，新建一个JAVA_HOME环境变量，值为d:/Program Files/Java/jdk/

5、在cmd中输入java -version 如果同你安装的jdk版本，说明安装成功

二、安装Tomcat

1、下载地址：http://tomcat.apache.org/

2、安装目录：安装目录自己选择，这里假设为：d:/Program Files/tomcat

3、设置TOMCAT_HOME环境变量：方法同上，新建一个TOMCAT_HOME环境变量，值为d:/Program Files/tomcat

4、编辑tomcat/conf下面的server.xml 文件，找到Connector port="8080" 给这节点增加2个属性URIEncoding="UTF-8" useBodyEncodingForURI="true"，修改这里的目的是修改tomcat编码为UTF-8，nutch使用的是该编码否则中文会出现乱码

5、启动tomcat：d:/Program Files/tomcat/bin 下找到startup.bat 双击启动

6、在浏览器中输入：http://127.0.0.1:8080/ 出现欢迎界面表示安装成功

三、安装Cygwin

提示：至少要2G空间

1、下载地址：http://www.cygwin.com/setup.exe

2、安装目录：安装目录自己选择，这里假设为：d:/Program Files/cygwin 选择最基本的安装

四、安装Nutch

1、下载地址：http://apache.mirror.phpchina.com/lucene/nutch/nutch-0.9.tar.gz

2、解压到目录：d:/Program Files/nutch

3、在d:/Program Files/nutch 目录下建立一个urls.txt文件，编辑该文件，写上你要爬的网站，比如http://www.wokenet.com/bbs

4、打开nutch/conf/crawl-urlfilter.txt文件，把MY.DOMAIN.NAME字符替换为url.txt内的url的域名，其实更简单点，直接删除MY.DOMAIN.NAME这几个字就可以了，也就是说，只保存+^http://([a-z0-9]*/.)*这几个字就可以了，表示所有http的网站都同意爬行。

5、打开nutch/conf/conf/nutch-site.xml文件，在<configuration></configuration>内插入以下内容：

 
 <property> 
    <name>http.agent.name</name> 
    <value>nutchcvs</value> 
    <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
        please set this to a single word uniquely related to your organization.  
        NOTE: You should also check other related properties: 
        http.robots.agents 
        http.agent.description 
        http.agent.url 
        http.agent.email 
        http.agent.version 
        and set their values appropriately. 
    </description> 
</property> 
<property> 
 <name>http.agent.description</name> 
 <value></value> 
 <description>Further description of our bot- this text is used in 
    the User-Agent header. It appears in parenthesis after the agent name. 
    </description> 
</property> 
<property> 
    <name>http.agent.url</name> 
    <value></value> 
    <description>A URL to advertise in the User-Agent header. This will 
        appear in parenthesis after the agent name. Custom dictates that this 
        should be a URL of a page explaining the purpose and behavior of this 
        crawler. 
    </description> 
</property> 
<property> 
 <name>http.agent.email</name> 
 <value></value> 
 <description>An email address to advertise in the HTTP 'From' request 
        header and User-Agent header. A good practice is to mangle this 
        address (e.g. 'info at example dot com') to avoid spamming. 
    </description> 
</property> 
 

五、用nutch进行爬行

由于配置nutch采用的是单独网站的配置方式，所以执行上我们也采用的是单网查询，全网查询在以后的内容中介绍。

先看一看nutch给出的命令：nutch crawl urls.txt -dir crawl -depth 3 -topN 50

    crawl：通知nutch.jar，执行crawl的main方法。
    urls：存放需要爬行的url.txt文件的目录，注意，这个名字需要和你的文件夹目录相同，如果你的文件夹为search，那这里也应该改成search。
    -dir crawl：爬行后文件保存的位置，可以在nutch/bin目录下找到。
    -depth 3：爬行次数，或者成为深度，不过还是觉得次数更贴切，建议测试时改为1。
    -topN 50：一个网站保存的最大页面数。

1、打开cygwin的快捷方式，用cd命令进入nutch安装目录d:/Program Files/nutch 执行命令：sh ./bin/nutch crawl urls.txt -dir crawl -depth 3 -topN 50

2、等待爬行结束后

六、执行查询

1、在nutch安装目录下找到nutch-0.9.war文件，改名为nutch.war解压到tomcat/webapps目录下。

2、webapps/nutch/WEB-INF/classes/nutch-site.xml 文件内容如下：

 
 <property>
    <name>searcher.dir</name> 
    <value>d:/Program Files/nutch/crawl</value> 
</property>
 

3、启动tomcat，在浏览器中输入http://127.0.0.1:8080/nutch 好了,恭喜你安装成功了!

大家可以看看我使用nutch开发的例子：笔记本搜索引擎蜗壳网http://www.wokenet.com/

如果还有什么疑问可以给我留言,谢谢查看

chris1081

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录