CentOS6.5下nutch1.7+solr4.8.1+Eclipse环境搭建(一)之nutch1.7安装

       由于自己在搭建整套环境中,遇到了很多问题。网上查找了一些资料,但是没有发现完整的说明。在此做个记录,也供新学者参考。此处重点说明CentOS6.5下nutch1.7+solr4.8.1+Eclipse环境搭建,至于CentOS的安装、Tomcat安装以及JDK安装的资料很多,在这就不赘述了。

1、CentOS6.5下安装JDK(jdk-7u71-linux-i586.rpm),具体步骤就不多说了;
4.8的版本需要jdk1.7的支持

2、下载nutch1.7的Linux版本
  •  下载地址:http://archive.apache.org/dist/nutch/1.7/
  •  下载文件:apache-nutch-1.7-bin.tar.gz 和 apache-nutch-1.7-src.tar.gz
  •  保存路径:/usr/download/(不存在的话就自己创建)
3、解压apache-nutch-1.7-bin.tar.gz
  •  命令:[hadoop@localhost ~]$ cd /usr/download/
  •  命令:[hadoop@localhost download]$ tar -zxvf apache-nutch-1.7-bin.tar.gz
  •  命令:[hadoop@localhost download]$ mv apache-nutch-1.7 /usr/
  •  命令:[hadoop@localhost download]$ mv apache-nutch-1.7 nutch
4、设置nutch的环境变量
[hadoop@localhost usr]$ vi /etc/profile
新增如下内容:
# set nutch environment
export NUTCH_HOME=/usr/nutch
export PATH=$PATH:$NUTCH_HOME/bin

5、使环境变量生效
[hadoop@localhost usr]$ source /etc/profile

6、检查nutch安装是否完成
[hadoop@localhost usr]$ nutch
输出如下内容:
Usage: nutch COMMAND
where COMMAND is one of:
  crawl             one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)
  readdb            read / dump crawl db
  mergedb           merge crawldb-s, with optional filtering
  readlinkdb        read / dump link db
  inject            inject new urls into the database
  generate          generate new segments to fetch from crawl db
  freegen           generate new segments to fetch from text files
  fetch             fetch a segment's pages
  parse             parse a segment's pages
  readseg           read / dump segment data
  mergesegs         merge several segments, with optional filtering and slicing
  updatedb          update crawl db from segments after fetching
  invertlinks       create a linkdb from parsed segments
  mergelinkdb       merge linkdb-s, with optional filtering
  index             run the plugin-based indexer on parsed segments and linkdb
  solrindex         run the solr indexer on parsed segments and linkdb
  solrdedup         remove duplicates from solr
  solrclean         remove HTTP 301 and 404 documents from solr
  clean             remove HTTP 301 and 404 documents from indexing backends configured via plugins
  parsechecker      check the parser for a given url
  indexchecker      check the indexing filters for a given url
  domainstats       calculate domain statistics from crawldb
  webgraph          generate a web graph from existing segments
  linkrank          run a link analysis program on the generated web graph
  scoreupdater      updates the crawldb with linkrank scores
  nodedumper        dumps the web graph's node scores
  plugin            load a plugin and run one of its classes main()
  junit             runs the given JUnit test
or
  CLASSNAME         run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

7、参数设置
7.1 设置nutch-site.xml中的http代理名称,value可随意。但是这样只能抓取教育等普通网站,要想抓取www.1688.com,就要模拟浏览器,这时候的value就要设为: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36
[hadoop@localhost usr ]# vi /usr/nutch/conf/nutch-site.xml
<? xml version = "1.0" ?>
<? xml-stylesheet type = "text/xsl" href = "configuration.xsl" ?>
 
<!-- Put site-specific property overrides in this file. -->
 
< configuration >
< property >
   < name >http.agent.name</ name >
   < value > Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 </ value >
</ property >
</ configuration >
7.2 设置nutch-default.xml中的插件路径
<property>
    <name>plugin.folders</name>
    <value> /usr/nutch/plugins</value>
    <description>Directories where nutch plugins are located.  Each
       element may be a relative or absolute path.  If absolute, it is used
       as is.  If relative, it is searched for on the classpath.</description>
</property>
 
8、抓数据测试
假如我想抓取http://www.taobao.cn/

8.1 创建urls文件夹和seed.txt(url入口)

新建urls目录,用于存放首要抓取的url列表

?
1
2
[hadoop@localhost usr]# mkdir urls
[ hadoop@localhost usr ]# vim urls/seed.txt

写入 

http://www. taobao .com/
如果多个的话每一行写一个url


8.2 编辑抓取网址的过滤规则
修改抓取url正则,仅允许抓
taobao .com 上的内容。
?
1
[hadoop@localhost usr]# vim conf/regex-urlfilter.txt

 修改

?
1
2
# accept anything else
+.

?
1
+^http://([a-z0-9]*\.)*taobao.com/
如果不限制的话就不需要修改


8.3 运行抓取命令
[hadoop@localhost usr]# nutch crawl urls -dir taobao -depth 10 -threads 10 -topN 10

参数说明:
- dir   dirnames      设置保存所抓取网页的目录. 
-depth  depth   表明抓取网页的层次深度
-delay  delay    表明访问不同主机的延时,单位为“秒”
-threads  threads      表明需要启动的线程数
-topN number    在每次迭代中限制爬行的头几个链接数,默认是Integer.MAX_VALUE

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值