nutch研究—基本使用相关说明

最新推荐文章于 2024-07-20 17:24:00 发布

CarsonNiu

最新推荐文章于 2024-07-20 17:24:00 发布

阅读量2.2k

点赞数

分类专栏： Nutch 文章标签： tomcat properties windows string behavior jdk

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/nxh_love/article/details/6609394

版权

Nutch 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

nutch 研究几乎花费了我半个月的时间，现在把一些过程和遇到错误和解决办法归纳如下：

1、准备工作（软件）windows环境下：

nutch-1.2的开源包，最新版1.3暂时不会用

cygwin软件：windows下模拟linux环境

tomcat6.0服务器：我从官网直接下载的非安装版

jdk1.6以上版本

2、步骤：

jdk安装：配置环境变量

tomcat6.0：

我放在C:\tomcat-6.0，启动服务C:\tomcat-6.0\bin\startup.bat;

关闭服务C:\tomcat-6.0\bin\shutdown.bat

cygwin安装（最重要）：

安装默认路径：C:\cygwin

配置环境变量：path中加上;C:\cygwin\bin（;不可少）；另一变量CYGWIN，变量值nodosfilewarning（否则会出错）

测试安装成功：打开cygwin,切换到nutch-1.2（为了方便，把nutch-1.2解压缩到C:\cygwin\home\nxh下）目录在，输入bin/nutch 出现帮助信息说明安装成功

1.在Nutch安装目录下新建一个urls目录，在urls目录中添加一个文本文件，其内容可以是你想抓取的网站的url地址。我的示例地址为www.sina.com.cn.

2.修改conf/crawl-urlfilter.txt.过滤规则以“+”表示允许下载。默认的规则如下：

+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

可以更改设置为+^http://([a-z0-9]*\.)*sina.com.cn/

3修改conf/nutch-site.xml文件，源文件是<configuration></configuration>,在这两个标签内加上如下代码：

<property>

<name>http.agent.name</name>

<value>myfirsttest</value>

<description>HTTP 'User-Agent' request header. MUST NOT be empty -

please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agents

http.agent.description

http.agent.url

http.agent.email

http.agent.version

and set their values appropriately.

</description>

</property>

<property>

<name>http.agent.description</name>

<value>myfirsttest</value>

<description>Further description of our bot- this text is used in

the User-Agent header. It appears in parenthesis after the agent name.

</description>

</property>

<property>

<name>http.agent.url</name>

<value>myfirsttest.com</value>

<description>A URL to advertise in the User-Agent header. This will

appear in parenthesis after the agent name. Custom dictates that this

should be a URL of a page explaining the purpose and behavior of this

crawler.

</description>

</property>

<property>

<name>http.agent.email</name>

<value>test@test.com</value>

<description>An email address to advertise in the HTTP 'From' request

header and User-Agent header. A good practice is to mangle this

address (e.g. 'info at example dot com') to avoid spamming.

</description>

</property>

4.执行crawl命令。典型的命令如下：

bin/nutch crawl urls -dir sohu -depth 3 -topN 100 -threads 3 >& sohu.log

-dir ：存放爬行结果的目录

-depth：抓取的页面深度

-topN：每一层抓取前N个URL

-threads：下载的线程数目

3、搜索

将Nutch目录下面Nutch-1.2.war部署到tomcat的wepapp目录下，启动tomcat.在解压后的nutch-1.2目录下找到nutch-site.xml文件，修改其内容如下：

<configuration>

<property>

<name>http.agent.name</name>

<value>myfirsttest</value>

</property>

<property>

<name>searcher.dir</name>

<value>C:/cygwin/home/nxh/nutch-1.2/sohu</value>

</property>

</configuration>

重新启动Tomcat。通过浏览器访问：http://localhost:8080/nutch-1.2 ，将看到搜索页面：

1. 搜索中文出现乱码，但这并不是nutch的问题，修改tomcat配置文件tomcat6\conf\server.xml。增加URIEncoding/useBodyEncodingForURI两项。

<Connector port="8080" protocol="HTTP/1.1"

connectionTimeout="20000"

redirectPort="8443"

URIEncoding="UTF-8"

useBodyEncodingForURI="true"/>

2.网页快照出现乱码，修改webapps\nutch-1.2 \cached.jsp，将content = new String(bean.getContent(details))修改为content = new String(bean.getContent(details),"utf-8")。

修改参考文章：

Nutch1.2 添加IKAnalyzer中文分词：http://blog.csdn.net/jiutao_tang/article/details/6461884

Nutch 二次开发需要修改的东西：http://blog.csdn.net/jiutao_tang/article/details/6524346

Nutch二次开发总结（二）：http://hi.baidu.com/jessicakey/blog/item/3423ad4924956ee382025c98.html

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
nutch研究—基本使用相关说明

nutch 研究几乎花费了我半个月的时间，现在把一些过程和遇到错误和解决办法归纳如下：1、准备工作（软件）windows环境下： nutch-1.2的开源包，最新版1.3暂时不会用 cygwin软件：windows下模拟linux环境 tomcat6.0服务器：我从官网直接下载的
复制链接

扫一扫

专栏目录

评论 1

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。