由于自己在搭建整套环境中,遇到了很多问题。网上查找了一些资料,但是没有发现完整的说明。在此做个记录,也供新学者参考。此处重点说明CentOS6.5下nutch1.7+solr4.8.1+Eclipse环境搭建,至于CentOS的安装、Tomcat安装以及JDK安装的资料很多,在这就不赘述了。
1、CentOS6.5下安装JDK(jdk-7u71-linux-i586.rpm),具体步骤就不多说了;
【
4.8的版本需要jdk1.7的支持】
2、下载nutch1.7的Linux版本
- 下载地址:http://archive.apache.org/dist/nutch/1.7/
- 下载文件:apache-nutch-1.7-bin.tar.gz 和 apache-nutch-1.7-src.tar.gz
- 保存路径:/usr/download/(不存在的话就自己创建)
3、解压apache-nutch-1.7-bin.tar.gz
- 命令:[hadoop@localhost ~]$ cd /usr/download/
- 命令:[hadoop@localhost download]$ tar -zxvf apache-nutch-1.7-bin.tar.gz
- 命令:[hadoop@localhost download]$ mv apache-nutch-1.7 /usr/
- 命令:[hadoop@localhost download]$ mv apache-nutch-1.7 nutch
4、设置nutch的环境变量
[hadoop@localhost usr]$ vi /etc/profile
新增如下内容:
# set nutch environment
export NUTCH_HOME=/usr/nutch
export PATH=$PATH:$NUTCH_HOME/bin
export NUTCH_HOME=/usr/nutch
export PATH=$PATH:$NUTCH_HOME/bin
5、使环境变量生效
[hadoop@localhost usr]$ source /etc/profile
6、检查nutch安装是否完成
[hadoop@localhost usr]$ nutch
输出如下内容:
Usage: nutch COMMANDwhere COMMAND is one of:crawl one-step crawler for intranets (DEPRECATED - USE CRAWL SCRIPT INSTEAD)readdb read / dump crawl dbmergedb merge crawldb-s, with optional filteringreadlinkdb read / dump link dbinject inject new urls into the databasegenerate generate new segments to fetch from crawl dbfreegen generate new segments to fetch from text filesfetch fetch a segment's pagesparse parse a segment's pagesreadseg read / dump segment datamergesegs merge several segments, with optional filtering and slicingupdatedb update crawl db from segments after fetchinginvertlinks create a linkdb from parsed segmentsmergelinkdb merge linkdb-s, with optional filteringindex run the plugin-based indexer on parsed segments and linkdbsolrindex run the solr indexer on parsed segments and linkdbsolrdedup remove duplicates from solrsolrclean remove HTTP 301 and 404 documents from solrclean remove HTTP 301 and 404 documents from indexing backends configured via pluginsparsechecker check the parser for a given urlindexchecker check the indexing filters for a given urldomainstats calculate domain statistics from crawldbwebgraph generate a web graph from existing segmentslinkrank run a link analysis program on the generated web graphscoreupdater updates the crawldb with linkrank scoresnodedumper dumps the web graph's node scoresplugin load a plugin and run one of its classes main()junit runs the given JUnit testorCLASSNAME run the class named CLASSNAMEMost commands print help when invoked w/o parameters.
7、参数设置
7.1 设置nutch-site.xml中的http代理名称,value可随意。但是这样只能抓取教育等普通网站,要想抓取www.1688.com,就要模拟浏览器,这时候的value就要设为:
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36
[
hadoop@localhost usr
]# vi /usr/nutch/conf/nutch-site.xml
<?
xml
version
=
"1.0"
?>
<?
xml-stylesheet
type
=
"text/xsl"
href
=
"configuration.xsl"
?>
<!-- Put site-specific property overrides in this file. -->
<
configuration
>
<
property
>
<
name
>http.agent.name</
name
>
<
value
>
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36
</
value
>
</
property
>
</
configuration
>
7.2 设置nutch-default.xml中的插件路径
<property>
<name>plugin.folders</name>
<value> /usr/nutch/plugins</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>
<name>plugin.folders</name>
<value> /usr/nutch/plugins</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>
8、抓数据测试
假如我想抓取http://www.taobao.cn/
8.1 创建urls文件夹和seed.txt(url入口)
新建urls目录,用于存放首要抓取的url列表
12
[
hadoop@localhost usr]# mkdir urls[
hadoop@localhost usr ]# vim urls/seed.txthttp://www. taobao .com/写入
如果多个的话每一行写一个url
8.2 编辑抓取网址的过滤规则
修改抓取url正则,仅允许抓taobao .com 上的内容。
1
[
hadoop@localhost usr]# vim conf/regex-urlfilter.txt修改
12# accept anything else
+.
为
1+^http://([a-z0-9]*\.)*taobao.com/
如果不限制的话就不需要修改
8.3 运行抓取命令
[
hadoop@localhost usr]# nutch crawl urls -dir taobao -depth 10 -threads 10 -topN 10
参数说明:
-
dir
dirnames 设置保存所抓取网页的目录.
-depth depth 表明抓取网页的层次深度
-delay delay 表明访问不同主机的延时,单位为“秒”
-threads threads 表明需要启动的线程数
-topN number 在每次迭代中限制爬行的头几个链接数,默认是Integer.MAX_VALUE