CentOS6.5下nutch1.7+solr4.8.1+Eclipse环境搭建（一）之nutch1.7安装-CSDN博客

本文链接：https://blog.csdn.net/anxin323/article/details/43673891

由于自己在搭建整套环境中，遇到了很多问题。网上查找了一些资料，但是没有发现完整的说明。在此做个记录，也供新学者参考。此处重点说明CentOS6.5下nutch1.7+solr4.8.1+Eclipse环境搭建，至于CentOS的安装、Tomcat安装以及JDK安装的资料很多，在这就不赘述了。

1、CentOS6.5下安装JDK（jdk-7u71-linux-i586.rpm），具体步骤就不多说了；

【 4.8的版本需要jdk1.7的支持】

2、下载nutch1.7的Linux版本

下载地址：http://archive.apache.org/dist/nutch/1.7/
下载文件：apache-nutch-1.7-bin.tar.gz 和 apache-nutch-1.7-src.tar.gz
保存路径：/usr/download/（不存在的话就自己创建）

3、解压apache-nutch-1.7-bin.tar.gz

命令：[hadoop@localhost ~]$ cd /usr/download/
命令：[hadoop@localhost download]$ tar -zxvf apache-nutch-1.7-bin.tar.gz
命令：[hadoop@localhost download]$ mv apache-nutch-1.7 /usr/
命令：[hadoop@localhost download]$ mv apache-nutch-1.7 nutch

4、设置nutch的环境变量

[hadoop@localhost usr]$ vi /etc/profile

新增如下内容：

# set nutch environment
export NUTCH_HOME=/usr/nutch
export PATH=$PATH:$NUTCH_HOME/bin

5、使环境变量生效

[hadoop@localhost usr]$ source /etc/profile

6、检查nutch安装是否完成

[hadoop@localhost usr]$ nutch

输出如下内容：

Usage: nutch COMMAND

where COMMAND is one of:

crawl             one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)

readdb            read / dump crawl db

mergedb           merge crawldb-s, with optional filtering

readlinkdb        read / dump link db

inject            inject new urls into the database

generate          generate new segments to fetch from crawl db

freegen           generate new segments to fetch from text files

fetch             fetch a segment's pages

parse             parse a segment's pages

readseg           read / dump segment data

mergesegs         merge several segments, with optional filtering and slicing

updatedb          update crawl db from segments after fetching

invertlinks       create a linkdb from parsed segments

mergelinkdb       merge linkdb-s, with optional filtering

index             run the plugin-based indexer on parsed segments and linkdb

solrindex         run the solr indexer on parsed segments and linkdb

solrdedup         remove duplicates from solr

solrclean         remove HTTP 301 and 404 documents from solr

clean             remove HTTP 301 and 404 documents from indexing backends configured via plugins

parsechecker      check the parser for a given url

indexchecker      check the indexing filters for a given url

domainstats       calculate domain statistics from crawldb

webgraph          generate a web graph from existing segments

linkrank          run a link analysis program on the generated web graph

scoreupdater      updates the crawldb with linkrank scores

nodedumper        dumps the web graph's node scores

plugin            load a plugin and run one of its classes main()

junit             runs the given JUnit test

or

CLASSNAME         run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

7、参数设置

7.1 设置nutch-site.xml中的http代理名称，value可随意。但是这样只能抓取教育等普通网站，要想抓取www.1688.com，就要模拟浏览器，这时候的value就要设为： Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36

 
  [hadoop@localhost usr 
  ]# vi /usr/nutch/conf/nutch-site.xml

 
  <? 
  xml 
   version 
  = 
  "1.0" 
  ?> 
 

 
  <? 
  xml-stylesheet 
   type 
  = 
  "text/xsl" 
   href 
  = 
  "configuration.xsl" 
  ?> 
 

 
  <!-- Put site-specific property overrides in this file. --> 
 

 
  < 
  configuration 
  > 
 

 
  < 
  property 
  > 
 

 
     
  < 
  name 
  >http.agent.name</ 
  name 
  > 
 

 
     
  < 
  value 
  > 
  Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 
  </ 
  value 
  > 
 

 
  </ 
  property 
  > 
 

 
  </ 
  configuration 
  > 
 

7.2 设置nutch-default.xml中的插件路径

<property>
<name>plugin.folders</name>
<value> /usr/nutch/plugins</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>

8、抓数据测试

假如我想抓取http://www.taobao.cn/

8.1 创建urls文件夹和seed.txt(url入口)

新建urls目录，用于存放首要抓取的url列表

?

1

2

[hadoop@localhost usr]# mkdir urls

[ hadoop@localhost usr ]# vim urls/seed.txt

写入

http://www. taobao .com/

如果多个的话每一行写一个url

8.2 编辑抓取网址的过滤规则

修改抓取url正则，仅允许抓
taobao .com 上的内容。

?

1

[hadoop@localhost usr]# vim conf/regex-urlfilter.txt

修改

?

1

2

# accept anything else

+.

为

?

1

+^http://([a-z0-9]*\.)*taobao.com/

如果不限制的话就不需要修改

8.3 运行抓取命令

 
   [hadoop@localhost usr]# nutch crawl urls -dir taobao -depth 10 -threads 10 -topN 10 
  

    参数说明： 
  

 
   - 
   dir   
   dirnames      设置保存所抓取网页的目录.  
  

 
   -depth  depth   表明抓取网页的层次深度 
  

 
   -delay  delay    表明访问不同主机的延时，单位为“秒” 
  

 
   -threads  threads      表明需要启动的线程数 
  

 
   -topN number    在每次迭代中限制爬行的头几个链接数,默认是Integer.MAX_VALUE