一、MySQL配置
1、my.ini配置
找到 [mysqld]在下面添加 skip-grant-tables和character-set-server=utf8找到[mysql]、[client]在下面添加default-character-set=utf8
重启mysql服务
(注:)如果已有的话就不需要添加
2、创建数据库与表
CREATE TABLE `webpage` (
`id` varchar(767) CHARACTER SET latin1 NOT NULL,
`headers` blob,
`text` mediumtext,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,
`content` mediumblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
`batchId` varchar(500) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
二、Nutch的安装与配置以及使用
1、Nutch-2.2.X下载:http://apache.fayea.com/apache-mirror/nutch/下载,然后解压至本地安装目录,如本地根目录为${NUTCH_HOME};
2、配置nutch对mysql的支持,修改${APACHE_NUTCH_HOME}/ivy/ivy.xml文件,分别:
1)找到以下行取消注释
<dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>
2)修改以下行
默认为
<dependency org="org.apache.gora" name="gora-core" rev="0.3" conf="*->default"/>
修改后为
<dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/>
3)取消注释以下行
<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
注释:上2)、3)如果不修改会有异常异常信息为
Exception in thread “main” Java.lang.ClassNotFoundException:org.apache.gora.sql.store.SqlStore
3、数据库连接配置
编辑${NUTCH_HOME}/conf/gora.properties文件,注释掉默认的数据库连接配置,同时添加以下配置内容:
###############################
# MySQL properties
################################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://192.168.58.1:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=liuxun123
写上你需要连接的数据库地址以及用户名密码
4、修改nutch-site配置文件
将以下内容添加至${NUTCH_HOME}/conf/nutch-site.xml中的configuration节点中
<property>
<name>http.agent.name</name>
<value>LiuXun Nutch Spider</value>
</property>
<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>
<property>
<name>generate.batch.id</name>
<value>*</value>
</property>
5、编译Nutch-2.2.*
1)首先安装Ant
2)进入${NUTCH_HOME}目录下执行ant命令既可
3)编译成功后${NUTCH_HOME}目录下会有runtime这个目录
6、网页抓取以及配置
1)进入${NUTCH_HOME}/runtime/local目录下
2)设置抓取的网站
执行命令
mkdir -p urls
echo 'http://www.oschina.net/' > urls/seed.txt
3)爬取操作
bin/nutch crawl urls -depth 3 -topN 5
nutch命令前面章节介绍到了
执行完在mysql中即查看到爬虫抓取的内容,如下图: