eclipse中导入nutch

最新推荐文章于 2018-04-14 18:45:55 发布

lidoublewen

最新推荐文章于 2018-04-14 18:45:55 发布

阅读量3.6k

点赞数

分类专栏： data mining and imformation retrieval linux之ubuntu 文章标签： eclipse generator properties plugins java build

本文链接：https://blog.csdn.net/lidoublewen/article/details/6387742

版权

data mining and imformation retrieval 同时被 2 个专栏收录

6 篇文章 1 订阅

订阅专栏

linux之ubuntu

5 篇文章 0 订阅

订阅专栏

今天想看看nutch的源码，就把nutch导入进eclipse中了

1 .首先下载nutch的压缩包：http://labs.renren.com/apache-mirror/nutch/

关于下载哪个版本，就要看你机子上安装的jdk的版本了，看看下面这个要求吧（我刚开始没留意到这个，结果折腾了我一些时间）

What Java version is required to run Nutch?
Nutch 0.7 will run with Java 1.4 and up. Nutch 1.0 with Java 6.

原先我是下载了nutch1.0的，导入到eclipse老是报错，原来是因为我装的jdk的版本是java version "1.5.0_06"，所以我还是弄个低版本的nutch好了，最后下了个nutch0.9就ok了。

2 .解压nutch-0.9.tar.gz

3 .在eclipse新建一个java project，我这里的工程名就是叫nutch

4 .复制nutch-0.9/src/java目录下的org文件夹，并将org文件夹粘贴到工程nutch下的src目录下

5 .复制nutch-0.9目录下的plugins文件夹，并将plugins文件夹粘贴到工程nutch下的src目录下

6 .把nutch-0.9的目录下的conf添加到工程nutch目录下，里面是配置文件，然后在eclipse中选中conf右击选中build path-->configure build path--->add folder选中conf ,ok.

7 .在工程的conf目录下找到nutch-default.xml，对两个属性进行修改（红色字体部分为修改的部分）

第一个要修改的是http.agent.name

<property>
<name>http.agent.name</name>
<value>darwin's spider</value> 这个名字可以随便取
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

    http.robots.agents
    http.agent.description
    http.agent.url
    http.agent.email
    http.agent.version

and set their values appropriately.

</description>
</property>

第二个要修改的是 plugin properties

<!-- plugin properties -->

<property>

  <name>plugin.folders</name>

  <value>./src/plugins</value>   这个跟第五部的设置的路径是对应的

  <description>Directories where nutch plugins are located.  Each

  element may be a relative or absolute path.  If absolute, it is used

  as is.  If relative, it is searched for on the classpath.</description>

</property>




8.

在工程的conf目录下找到crawl-urlfilter.txt，在这个里面+^http://([a-z0-9]*/.)*MY.DOMAIN.NAME
/ 把这里红字体改成你要抓取的网站 

比如 +^http://([a-z0-9]*/.)*163.com
/

9 .在nutch目录下面新建一个文件夹urls，在urls下面建立一个文件urls，无扩展名，用记事本写入要爬网站的网址。
比如：http://www.163.com（与第8步设置对应）

10 .添加jar文件，所有要添加的jar文件都在nutch-0.9/lib的目录下，将它们都添加的工程里就可以了

11 .最后运行org.apache.nutch.crawl这个包下的crawl.java 。记住在运行之前要配置一下运行需要的参数也就是点击RUn As... 选择弹出的run configuration里面的Argument选项卡。

Program arguments 添加： urls -dir crawl -depth 3 -topN 50
VM arguments 添加： -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log

-----------------------------------------------------------------------------------------------------------------------

运行结果如下：

crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20110503050411
Generator: filtering: false
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20110503050411
Fetcher: threads: 10
fetching http://www.163.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20110503050411]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20110503050418
Generator: filtering: false
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20110503050418
Fetcher: threads: 10
fetching http://digi.163.com/
fetching http://money.163.com/fund/
fetching http://caipiao.163.com/
fetching http://mobile.163.com/
fetching http://sports.163.com/
fetching http://gongyi.163.com/
fetching http://blog.163.com/blogger.html
fetching http://sports.163.com/nba/
fetching http://money.163.com/stock/
fetching http://pay.163.com/
fetching http://lady.163.com/beauty/
fetching http://ent.163.com/movie/
fetching http://discovery.163.com/
fetching http://book.163.com/
fetching http://biz.163.com/
fetching http://ent.163.com/
fetching http://fashion.163.com/
fetching http://fushi.163.com/
fetch of http://gongyi.163.com/ failed with: Http code=503, url=http://gongyi.163.com/
fetching http://photo.163.com/
fetching http://edu.163.com/
fetching http://news.163.com/review/