1.安装mysql,并创建nutch库,执行一下建表语句:
CREATE TABLE `webpage` (
`id` varchar(250) NOT NULL,
`headers` blob,
`text` longtext,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`prevModifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(250) DEFAULT NULL,
`content` longblob,
`title` varchar(250) DEFAULT NULL,
`reprUrl` varchar(250) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
2.在官网上下载并安装Cygwin
3.在官网上下载http://www.apache.org/dyn/closer.cgi/nutch/下载nutch 2.2.x,需要注意2.2以上版本不支持mysql
解压压缩包到home目录下,修改相关配置信息:
1)修改nutch 2.2.x/ivy/ivy.xml文件,分别:
l 将以下行的注释取消
1
|
<dependency org=”mysql” name=”mysql-connector-java” rev=”5.1.18″ conf=”*->
default
”/>
|
l 修改以下行。从默认的
1
|
<dependency org=
"org.apache.gora"
name=
"gora-core"
rev=
"0.3"
conf=
"*->default"
/>
|
改成
1
|
<dependency org=
"org.apache.gora"
name=
"gora-core"
rev=
"0.2.1"
conf=
"*->default"
/>
|
l 将以下行的注释取消
1
|
<dependency org=
"org.apache.gora"
name=
"gora-sql"
rev=
"0.1.1-incubating"
conf=
"*->default"
/>
|
注:上述第2和第3项,如果按默认的不做修改,将会在抓取网页时遇到以下错误。
Exception in thread “main” Java.lang.ClassNotFoundException:org.apache.gora.sql.store.SqlStore
2) 数据库连接配置
编辑${NUTCH_HOME}/conf/gora.properties文件,注释掉默认的数据库连接配置,同时添加以下配置内容:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
###############################
# Default MySQL properties #
###############################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=xxxx(MySQL用户名)
gora.sqlstore.jdbc.password=xxxx(MySQL密码)
|
3) 修改nutch-site配置文件
我的做法是直接将nutch-default文件另存为nutch-site,然后修改nutch-site内容,包括:
l 添加http.agent.name的值
1
2
3
4
5
6
7
|
<property>
<name>http.agent.name</name>
<value>YourNutchSpider</value>
</property>
|
l 在文件末尾添加以下内容
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
|
<property>
<name>http.
accept
.language</name>
<value>ja-jp, en-us,en-gb,en;
q=0.7,*;q=
0.3</value>
<description>Value of the Accept-Language request header field.
This allows selecting non-English language as
default
one to retrieve.
It is a useful setting
for
search engines build
for
certain national group.
</description>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class
for
storing and retrieving data.
Currently the following stores are available:.
</description>
</property>
<property>
<name>parser.character.encoding.
default
</name>
<value>utf-8</value>
<description>The character encoding to fall back to
when
no
other information
is available</description>
</property>
|
l 特别添加以下内容
1
2
3
4
5
6
7
|
<property>
<name>generate.batch.id</name>
<value>*</value>
</property>
|
<property name="repo.maven.org"
value="中央仓库地址"
override="false"/>
4.在官网上下载并安装ant,注意配置ant的环境变量
5.用ant编译nutch,在Cygwin命令窗口:/home/nutch/$ ant,命令窗口提示:build successfully 就表示编译成功,这个地方需要注意,因为hadoop在window下有权限问题,需要修改hadoop-core-1.2.0.jar里边的FileUtil类:
下载hadoop-1.2.0程序包,修改hadoop-core-1.2.0.src\org\apache\hadoop\fs\FileUtil.java:
注释掉:if (!rv)
/* 691 */ throw new IOException(new StringBuilder().append("Failed to set permissions of path: ").append(p).append(" to ").append(String.format("%04o", new Object[] { Short.valueOf(permission.toShort()) })).toString());
/* */ }此段代码,编译成FileUtil.class替换hadoop-core-1.2.0.jar里边的FileUtil类即可。
6.运行nutch进行数据抓取:
1) 设置抓取的网站
1
2
3
|
cd ${NUTCH_HOME}/runtime/
local
mkdir
-p urls
echo
'http://www.tianya.cn'
> urls/seed.txt
|
2) 执行爬取操作
1
|
bin/nutch crawl urls -depth 3 -topN 5
|
3) 执行完在mysql表webpage中即可看到抓取的内容