CentOS 6.4环境下的Apache Nutch 1.7 + Solr 4.4.0安装笔记

本人原创,转载请注明出处:http://blog.csdn.net/panjunbiao/article/details/12171147

Nutch安装

参考文档:http://wiki.apache.org/nutch/NutchTutorial

安装必要程序:
yum update
yum list java* 
yum install java-1.7.0-openjdk-devel.x86_64 

找到java的安装路径:
参考:http://serverfaullt.com/questions/50883/what-is-the-value-of-java-home-for-centos
设置JAVA_HOME:
参考:http://www.cnblogs.com/zhoulf/archive/2013/02/04/2891608.html

vi + /etc/profile
JAVA_HOME=/usr/lib/jvm/java
JRE_HOME=/usr/lib/jvm/java/jre
PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib
export JAVA_HOME JRE_HOME PATH CLASSPATH
使profile文件立即生效:
source /etc/profile

下载二进制包文件:
curl -O http://apache.fayea.com/apache-mirror/nutch/1.7/apache-nutch-1.7-bin.tar.gz

解包:
tar -xvzf apache-nutch-1.7-bin.tar.gz
 
检验运行文件
cd apache-nutch-1.7
bin/nutch
此时会出现用法帮助,表示安装成功了。

修改文件conf/nutch-site.xml,设置HTTP请求中agent的名字: 
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
  <property>
    <name>http.agent.name</name>
    <value>Friendly Crawler</value>
  </property>
</configuration>

创建种子文件夹
mkdir -p urls
 
执行第一次爬虫任务:
bin/nutch crawl urls -dir crawl
solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 5
solrUrl=null
Injector: starting at 2013-09-29 12:01:30
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 0
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-09-29 12:01:33, elapsed: 00:00:03
Generator: starting at 2013-09-29 12:01:33
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl
由于没有设置任何种子URL,所以爬虫什么都不做就退出了。

将种子URL写到文件urls/seed.txt中:
http://www.36kr.com/
vi conf/regex-urlfilter.txt
# accept anything else
# +.

# added by panjunbiao
+36kr.com

再次执行爬虫程序,发现有些种子网站被skip了:
bin/nutch crawl urls -dir crawl
solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 5
solrUrl=null
Injector: starting at 2013-09-29 12:10:24
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-09-29 12:10:27, elapsed: 00:00:03
Generator: starting at 2013-09-29 12:10:27
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20130929121029
Generator: finished at 2013-09-29 12:10:30, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2013-09-29 12:10:30
Fetcher: segment: crawl/segments/20130929121029
  • 2
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值