eclipse中导入nutch

 

今天想看看nutch的源码,就把nutch导入进eclipse中了

 

1 .首先下载nutch的压缩包:http://labs.renren.com/apache-mirror/nutch/

关于下载哪个版本,就要看你机子上安装的jdk的版本了,看看下面这个要求吧(我刚开始没留意到这个,结果折腾了我一些时间)

What Java version is required to run Nutch?
Nutch 0.7 will run with Java 1.4 and up. Nutch 1.0 with Java 6.

原先我是下载了nutch1.0的,导入到eclipse老是报错,原来是因为我装的jdk的版本是java version "1.5.0_06",所以我还是弄个低版本的nutch好了,最后下了个nutch0.9就ok了。

 


2 .解压nutch-0.9.tar.gz

 

 

3 .在eclipse新建一个java project,我这里的工程名就是叫nutch

 

 

4 .复制nutch-0.9/src/java目录下的org文件夹,并将org文件夹粘贴到工程nutch下的src目录下

 

 

5 .复制nutch-0.9目录下的plugins文件夹,并将plugins文件夹粘贴到工程nutch下的src目录下

 

 

6 .把nutch-0.9的目录下的conf添加到工程nutch目录下,里面是配置文件,然后在eclipse中选中conf右击选中build path-->configure build path--->add folder选中conf ,ok.

 

 

7 .在 工程的conf目录下找到nutch-default.xml,对两个属性进行修改(红色字体部分为修改的部分

第一个要修改的是http.agent.name

<!-- HTTP properties -->

<property>
  <name>http.agent.name</name>
  <value>darwin's spider</value>  这个名字可以随便取
  <description>HTTP 'User-Agent' request header. MUST NOT be empty -
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

    http.robots.agents
    http.agent.description
    http.agent.url
    http.agent.email
    http.agent.version

  and set their values appropriately.

  </description>
</property>

 

第二个要修改的是 plugin properties

<!-- plugin properties -->
<property>
<name>plugin.folders</name>
<value>./src/plugins</value>   这个跟第五部的设置的路径是对应的 <description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>

8. 在工程的conf目录下找到crawl-urlfilter.txt,在这个里面+^http://([a-z0-9]*/.)*MY.DOMAIN.NAME / 把这里红字体改成你要抓取的网站
比如 +^http://([a-z0-9]*/.)*163.com /

9 .在nutch目录下面新建一个文件夹urls,在urls下面建立一个文件urls,无扩展名,用记事本写入要爬网站的网址。
比如:http://www.163.com(与第8步设置对应)


 

10 .添加jar文件,所有要添加的jar文件都在nutch-0.9/lib的目录下,将它们都添加的工程里就可以了

 

 

11 .最后运行org.apache.nutch.crawl这个包下的crawl.java 。记住在运行之前要配置一下运行需要的参数 也就是 点击RUn As...   选择弹出的run configuration里面的Argument选项卡。

 Program arguments   添加:  urls -dir crawl -depth 3 -topN 50
 VM arguments        添加:  -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log

-----------------------------------------------------------------------------------------------------------------------

运行结果如下:

crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20110503050411
Generator: filtering: false
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20110503050411
Fetcher: threads: 10
fetching http://www.163.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20110503050411]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20110503050418
Generator: filtering: false
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20110503050418
Fetcher: threads: 10
fetching http://digi.163.com/
fetching http://money.163.com/fund/
fetching http://caipiao.163.com/
fetching http://mobile.163.com/
fetching http://sports.163.com/
fetching http://gongyi.163.com/
fetching http://blog.163.com/blogger.html
fetching http://sports.163.com/nba/
fetching http://money.163.com/stock/
fetching http://pay.163.com/
fetching http://lady.163.com/beauty/
fetching http://ent.163.com/movie/
fetching http://discovery.163.com/
fetching http://book.163.com/
fetching http://biz.163.com/
fetching http://ent.163.com/
fetching http://fashion.163.com/
fetching http://fushi.163.com/
fetch of http://gongyi.163.com/ failed with: Http code=503, url=http://gongyi.163.com/
fetching http://photo.163.com/
fetching http://edu.163.com/
fetching http://news.163.com/review/

........................................

 

opps,大功告成,可以开始研究nutch源码了........

 

 

附上一个官方的eclipse配置方法:http://wiki.apache.org/nutch/RunNutchInEclipse1.0

 

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值