在Eclipse中运行Nutch2.3


参考http://wiki.apache.org/nutch/RunNutchInEclipse


一、环境准备

1、下载nutch2.3源代码

  1. wget http://mirror.bit.edu.cn/apache/nutch/2.3/apache-nutch-2.3-src.tar.gz  
wget http://mirror.bit.edu.cn/apache/nutch/2.3/apache-nutch-2.3-src.tar.gz
或者下载正在开发中的最新版本
  1. svn co https://svn.apache.org/repos/asf/nutch/branches/2.x  
 svn co https://svn.apache.org/repos/asf/nutch/branches/2.x


2、选择使用的数据库类型,以hbase为例
在conf/nutch-site.xml中增加以下属性:

  1. <property>  
  2.   <name>storage.data.store.class</name>  
  3.   <value>org.apache.gora.hbase.store.HBaseStore</value>  
  4.   <description>Default class for storing data</description>  
  5.  </property>  
<property>
  <name>storage.data.store.class</name>
  <value>org.apache.gora.hbase.store.HBaseStore</value>
  <description>Default class for storing data</description>
 </property>


3、在ivy/ivy.xml中增加与hbase相关的依赖项,此项本已存在,但被注释掉,将注释去掉即可

  1. <dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default” />  
<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default” />
注意,rev=0.5对应hbase0.94,rev=0.3对应hbase0.90.4


4、在nutch.xml中增加以下3个属性

  1. <property>  
  2.    <name>http.agent.name</name>  
  3.    <value>My Nutch Spider</value>  
  4.  </property>  
  5. <property>  
  6.    <name>http.robots.agents</name>  
  7.    <value>none</value>  
  8.  </property>  
  9. <property>  
  10.    <name>plugin.folders</name>  
  11.    <value>/Users/liaoliuqing/0_Search/1_Nutch/1_Official/apache-nutch-2.3/build/plugins</value>  
  12.  </property>  
<property>
   <name>http.agent.name</name>
   <value>My Nutch Spider</value>
 </property>
<property>
   <name>http.robots.agents</name>
   <value>none</value>
 </property>
<property>
   <name>plugin.folders</name>
   <value>/Users/liaoliuqing/0_Search/1_Nutch/1_Official/apache-nutch-2.3/build/plugins</value>
 </property>
其中plugin.folders的值为$NUTCH_HOME/build/plugins


5、执行ant eclipse


二、导入project

1、导入project


2、在build path中,将apche-nutch-2.3/conf放到最上面,即点击top按键



三、运行程序

1、Run as ----> Run configuration,选择project与主类


2、填写参数

/Users/liaoliuqing/Downloads/seed.txt

-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log


3、点击run,输出结果如下:

InjectorJob: starting at 2015-01-28 16:27:43
InjectorJob: Injecting urlDir: /Users/liaoliuqing/Downloads/seed.txt
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1
Injector: finished at 2015-01-28 16:27:47, elapsed: 00:00:04


注意,在运行程序前,本机需要先启动hbase。


4、查看hbase中的数据

  1. hbase(main):003:0> scan 'webpage'  
  2. ROW                                         COLUMN+CELL                                                                                                                   
  3.  com.163.www:http/                          column=f:fi, timestamp=1422433667377, value=\x00'\x8D\x00                                                                     
  4.  com.163.www:http/                          column=f:ts, timestamp=1422433667377, value=\x00\x00\x01K/\xA7:\x14                                                           
  5.  com.163.www:http/                          column=mk:_injmrk_, timestamp=1422433667377, value=y                                                                          
  6.  com.163.www:http/                          column=mk:dist, timestamp=1422433667377, value=0                                                                              
  7.  com.163.www:http/                          column=mtdt:_csh_, timestamp=1422433667377, value=?\x80\x00\x00                                                               
  8.  com.163.www:http/                          column=s:s, timestamp=1422433667377, value=?\x80\x00\x00                                                                      
  9. 1 row(s) in 0.2970 seconds  
hbase(main):003:0> scan 'webpage'
ROW                                         COLUMN+CELL                                                                                                                 
 com.163.www:http/                          column=f:fi, timestamp=1422433667377, value=\x00'\x8D\x00                                                                   
 com.163.www:http/                          column=f:ts, timestamp=1422433667377, value=\x00\x00\x01K/\xA7:\x14                                                         
 com.163.www:http/                          column=mk:_injmrk_, timestamp=1422433667377, value=y                                                                        
 com.163.www:http/                          column=mk:dist, timestamp=1422433667377, value=0                                                                            
 com.163.www:http/                          column=mtdt:_csh_, timestamp=1422433667377, value=?\x80\x00\x00                                                             
 com.163.www:http/                          column=s:s, timestamp=1422433667377, value=?\x80\x00\x00                                                                    
1 row(s) in 0.2970 seconds






转载于:https://www.cnblogs.com/jpfss/p/7885887.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值