1 、安装ant(省略)
目前官方2.x只提供了源码下载,不再提供编译的版本,需要用户自己去编译。
2 下载nutch
2.1 下载地址:
http://www.apache.org/dyn/closer.lua/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gz
tar -zxvf apache-nutch-2.2.1-bin.tar.gz
2.2 下载sonar的jar包,将jar包放到apache-nutch-2.2.1目录下
修改build.xml,引入上面添加的jar包:
2.3 nutch编译安装时需要从maven资源库下载jar包,将其修改
vi apache-nutch-2.2.1/ivy/ivy.setting.xml
将value=http://repo1.maven.org/maven2/修改成value= http://repo2.maven.org/maven2/
3 nutch存储采用mysql
修改apache-nutch-2.2.1/ivy/ivy.xml文件,取消注释
修改
为
4 数据库连接配置
修改 apache-nutch-2.2.1/conf/gora.properties文件,注释掉默认的数据库连接配置,同时添加以下配置内容:
###############################
# Default SqlStore properties #
###############################
gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver
gora.sqlstore.jdbc.url=jdbc:mysql://10.202.13.175:3306/nutch-test
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=sf123456
5 手动创建数据库nutch和数据表webpage(可以自动创建)
CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` longtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`prevModifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
6 修改 ${NUTCH_HOME}/nutch-site.xml 配置文件
在中添加以下内容
http.agent.name
YourNutchSpider
http.accept.language
ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3
Value of the Accept-Language request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
storage.data.store.class
org.apache.gora.sql.store.SqlStore
The Gora DataStore class for storing and retrieving data.
Currently the following stores are available:.
parser.character.encoding.default
utf-8
The character encoding to fall back to when no other information
is available
generate.batch.id
*
7 ant编译
切换到apache-nutch.2.2.1主目录下,运行ant命令
8 手动添加jar包(非必须)
用ivy下载jar包会有几个下载不了,我们可以直接去http://repo2.maven.org/maven2/org/bouncycastle/bcprov-jdk15on/1.52/bcprov-jdk15on-1.52.jar下载
然后将jar包拷贝到ivy仓库,然后重新编译
9 网页抓取配置
9.1设置抓取的网站
cd apache-nutch-2.2.1/runtime/local
mkdir -p urls
echo 'http://www.sina.com' > urls/seed.txt
9.2执行爬取操作
cd apache-nutch-2.2.1\runtime\deploy\bin
bin/nutch crawl urls -depth 3 -topN 5