1. 安装hbase
a)下载hbase-0.90.6.tar.gz版本
b)解压hbase-0.90.6
c)配置java_home
vim /conf/hbase-env.sh
export JAVA_HOME=/home/seal/seal/workspace/jdk1.7.0_04
d)执行bin/start-hbase.sh
2. 安装nutch
a)下载apache-nutch-2.1
b)修改/conf/nutch-site.xml 文件
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>The Gora DataStore class for storing and retrieving data. </description>
</property>
c)修改conf/gora.properties
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
#gora.datastore.default=org.apache.gora.mock.store.MockDataStore
#gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver
#gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest
#gora.sqlstore.jdbc.user=sa
#gora.sqlstore.jdbc.password=
d)修改ivy/ivy.xml
<dependency org="org.apache.gora" name="gora-hbase" rev="0.2.1" conf="*->default" />
e)使用ant build
ant -f build.xml
f)将hbase 中的jar 文件copy 到nutch的lib中来。
3. 测试
在/home/seal/seal/workspace/apache-nutch-2.1/runtime/local/bin路径下新增加urls.txt 文件,内容为
http://knet.cn/
./nutch crawl urls.txt -depth 3 -topN 100
测试日志:
检查hbase 里的数据信息
问题:
[seal@epp2 bin]$ ./nutch inject urls.txt
InjectorJob: starting
InjectorJob: urlDir: urls.txt
Skipping www.knet.cn:java.net.MalformedURLException: no protocol: www.knet.cn
解决办法