Nutch1.4+solr1.4整合
1.解压Nutch1.4和solr1.4,(可选:配置NUTCH_HOME和SOLR_HOME)
2.修改$NUTCH_HOME/runtime/local/conf/regex-urlfilter.txt最后一行为
+^http://www.163.com/
3.在$NUTCH_HOME/runtime/local/conf/nutch-site.xml内添加
<property>
<name>http.agent.name</name>
<value>My Nutch Agent</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value></value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value></value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value></value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
4.在$NUTCH_HOME/runtime/local下创建
urls/seed.txt
添加内容
http://www.163.com
5.将$NUTCH_HOME/runtime/local/conf/schema-solr4.xml内
<schema></schema>替换
$SOLR_HOME/example/solr/conf/schema.xml
7.跳转 $SOLR_HOME/example/目录,启动solr
java -jar start.jar
在浏览器输入http://localhost:8983/solr访问.
8.在$NUTCH_HOME/runtime/local下输入命令
bin/nutch crawl urls -solr http://localhost:8983/solr/ -dir crawl -depth 10 -threads 5 -topN 100
如果已经执行过了,得先删除crawl目录: rm -rf crawl
9.等待执行结束.在http://localhost:8983/solr页面点击query查询
在q处输入 content:新闻,即可查询。
apache-nutch-1.4-bin.tar.gz
apache-solr-4.0-2012-05-08_08-10-03.tgz
此测试环境CENTOS6.2
1.解压Nutch1.4和solr1.4,(可选:配置NUTCH_HOME和SOLR_HOME)
2.修改$NUTCH_HOME/runtime/local/conf/regex-urlfilter.txt最后一行为
+^http://www.163.com/
3.在$NUTCH_HOME/runtime/local/conf/nutch-site.xml内添加
<property>
<name>http.agent.name</name>
<value>My Nutch Agent</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value></value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value></value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value></value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
4.在$NUTCH_HOME/runtime/local下创建
urls/seed.txt
添加内容
http://www.163.com
5.将$NUTCH_HOME/runtime/local/conf/schema-solr4.xml内
<schema></schema>替换
$SOLR_HOME/example/solr/conf/schema.xml
<schema></schema>
7.跳转 $SOLR_HOME/example/目录,启动solr
java -jar start.jar
在浏览器输入http://localhost:8983/solr访问.
8.在$NUTCH_HOME/runtime/local下输入命令
bin/nutch crawl urls -solr http://localhost:8983/solr/ -dir crawl -depth 10 -threads 5 -topN 100
如果已经执行过了,得先删除crawl目录: rm -rf crawl
9.等待执行结束.在http://localhost:8983/solr页面点击query查询
在q处输入 content:新闻,即可查询。