环境
nutch官网 http://nutch.apache.org/
linux系统 CentOS 7.3 64位
jdk1.8
apache-nutch-2.2.1-src.tar.gz
mysql
jdk配置
yum search jdk | grep java
yum install java-1.8.0-openjdk
vi /etc/profile
添加内容
#set java environment
JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-5.b12.el7_4.x86_64
JRE_HOME=$JAVA_HOME/jre
CLASS_PATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib
PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
export JAVA_HOME JRE_HOME CLASS_PATH PATH
ant 安装
yum install ant
nutch构建
wget http://archive.apache.org/dist/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gz
tar -xvf apache-nutch-2.2.1-src.tar.gz
cd apache-nutch-2.2.1/
ant
ivysettings.xml maven仓库更改
http://maven.aliyun.com/nexus/content/groups/public/
生成 runtime文件夹,其下有delopy local两个文件夹
deploy对hadoop有依赖,hdfs进行存储,而mapreduce进行分析,辅以其他的功能。 local没有依赖。
配置nutch对mysql的支持,修改${APACHE_NUTCH_HOME}/ivy/ivy.xml文件,分别:
--找到以下行取消注释
<dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>
<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
数据库连接配置
编辑${NUTCH_HOME}/conf/gora.properties文件
###############################
# MySQL properties
###############################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://192.168.58.1:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=123456
修改nutch-site配置文件
vim nutch-site.xml
<property>
<name>http.agent.name</name>
<value>LiuXun Nutch Spider</value>
</property>
<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>
//特别添加
<property>
<name>generate.batch.id</name>
<value>*</value>
</property>
然后命令行执行:
ant clear
再执行
ant runtime
开始爬取
mkdir urls
vim url.txt ------写入需要爬的网站
输入命令
bin/nutch crawl urls -depth 3 -topN 5