Nutch-0.9 研究 Whole-web Crawling<一>


## ### [b]Whole-web: Boostrapping the Web Database[/b]
wget http://www.alliedquotes.com/mirrors/apache/lucene/nutch/nutch-0.9.tar.gz
## unzip
tar xzvf nutch-0.9.tar.gz
mv nutch-0.9 nutch
cd nutch
## 得到一个url list 文件
wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
## 解压文件
gunzip content.rdf.u8.gz
##创建目录dmoz存放url list
mkdir dmoz
##
bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls
##注入链接
bin/nutch inject crawl/crawldb dmoz
## 编辑nutch-site文件加入一下内容
vi conf/nutch-site.xml



<property>
<name>http.agent.name</name>
<value>*</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version

and set their values appropriately.

</description>
</property>

<property>
<name>http.agent.description</name>
<value>test</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>

<property>
<name>http.agent.url</name>
<value>test</value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>

<property>
<name>http.agent.email</name>
<value>test.com</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>




#### [b]Whole-web: Fetching[/b]
##
bin/nutch generate crawl/crawldb crawl/segments
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1
bin/nutch updatedb crawl/crawldb $s1

##
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2

bin/nutch fetch $s2
bin/nutch updatedb crawl/crawldb $s2

##
s3=`ls -d crawl/segments/2* | tail -1`
echo $s3

bin/nutch fetch $s3
bin/nutch updatedb crawl/crawldb $s3

#### [b]Whole-web: Indexing[/b]

bin/nutch invertlinks crawl/linkdb crawl/segments/*

bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*

### 安装tomcat6
wget http://apache.imghat.com/tomcat/tomcat-6/v6.0.20/bin/apache-tomcat-6.0.20.tar.gz

tar xzvf apache-tomcat-6.0.20.tar.gz

mv apache-tomcat-6.0.20 /usr/share/tomcat6

#### Searching

bin/nutch org.apache.nutch.searcher.NutchBean apache

cp nutch*.war /usr/share/tomcat6/webapps/nutch.war

## 启动tomcat
/usr/share/tomcat6/bin/catalina.sh start





[b][color=darkred]注意:[/color][/b]
关于Nutch-0.9一定要使用tomcat6,我用yum安装的是tomcat5,用命令行可以搜出结果,可是在tomcat的web页面下一直显示搜不到结果,换tomcat6后一切正常.

[b][color=darkred]还有需要注意的几点:[/color][/b]

第一: 需要配置/usr/share/tomcat6/webapps/nutch/WEB-INF/classes/nutch-site.xml


<!-- HTTP properties -->
<property>
<name>http.agent.name</name>
<value>*</value>
<description></description>
</property>
<!-- file properties -->
<property>
<name>searcher.dir</name>
<value>/root/nutch/crawl</value>
<description></description>
</property>


第二:
org.apache.jasper.JasperException: /search.jsp(151,22) Attribute value  language + "/include/header.html" is quoted with " which must be escaped when used within the value
//这个错误需要编辑 seach.jsp cached.jsp explain.jsp anchors.jsp


<jsp:include page="<%= language + "/include/header.html"%>"/>
替换为
<jsp:include page="<%= language + \"/include/header.html\"%>"/>


<i18n:message key="page">
<i18n:messageArg value="<%=details.getValue("url")%>"/>
</i18n:message>

替换为

<i18n:message key="page">
<% String detailsStr=details.getValue("url");%>
<i18n:messageArg value="<%=detailsStr%>"/>
</i18n:message>



最后,该文大部分步骤来源于http://lucene.apache.org/nutch/tutorial8.html 0.8文档的,其中针对0.9的做了少量改动.

运行结果:

[img]/upload/attachment/136993/3ad658b5-d835-3501-9acf-3b4a87d3b505.png[/img]


更新: nutch-1.0已经更新了这几个jsp.. 只要确保nutch-default.xml和nutch-site.xml配置正确就可以了..
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值