基本环境:linux centos6.5 nutch2.2.1 源码包, mysql 5.5 ,elasticsearch1.1.1, jdk1.7
1、下载地址http://mirror.bjtu.edu.cn/apache/nutch/2.2.1/ 解压
2、修改数据存储方式是mysql
修改nutch根目录/ivy/ivy.xml文件,原来mysql数据存储是注释的。
104
107
108
109
110
111
112
3、修改连接数据库地址和用户名,在 nutch根目录/conf/gora.properties 将原来的注释掉
#gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver
#gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest
#gora.sqlstore.jdbc.user=sa
#gora.sqlstore.jdbc.password=
# MySQL properties #
###############################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://ip:3306/nutch? useUnicode=true&characterEncoding=utf8&autoReconnect=true&zeroDateTimeBehavior=convertToNull
gora.sqlstore.jdbc.user=user
gora.sqlstore.jdbc.password=pwd
4、修改修改conf的nutch-site.xml
http.agent.name
My Spider
http.accept.language
ja-jp,zh-cn,en-us,en-gb,en;q=0.7,*;q=0.3
parser.character.encoding.default
utf-8
The character encoding to fall back to when no other information
is available
storage.data.store.class
org.apache.gora.sql.store.SqlStore
plugin.includes
protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic
5、使用ant 编译源码
在nutch 目录下执行 ant
job:
[jar] Building jar:/home/hadoop/nutch221/build/apache-nutch-2.2.1.job
runtime:
[mkdir] Created dir:/home/hadoop/nutch221/runtime
[mkdir] Created dir:/home/hadoop/nutch221/runtime/local
[mkdir] Created dir:/home/hadoop/nutch221/runtime/deploy
[copy] Copying1 file to /home/hadoop/nutch221/runtime/deploy
[copy] Copying2 files to /home/hadoop/nutch221/runtime/deploy/bin
[copy] Copying1 file to /home/hadoop/nutch221/runtime/local/lib
[copy] Copying1 file to /home/hadoop/nutch221/runtime/local/lib/native[copy] Copying26 files to /home/hadoop/nutch221/runtime/local/conf
[copy] Copying2 files to /home/hadoop/nutch221/runtime/local/bin
[copy] Copying100 files to /home/hadoop/nutch221/runtime/local/lib
[copy] Copying106 files to /home/hadoop/nutch221/runtime/local/plugins
[copy] Copied2 empty directories to 2 empty directories under /home/hadoop/nutch221/runtime/local/test
BUILD SUCCESSFUL
Total time:41 seconds 编译成功。
6 创建数据库
CREATE DATABASE nutch DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
CREATE TABLE `webpage` (`id` varchar(767) CHARACTER SET latin1 NOT NULL,
`headers` blob,
`text` mediumtext DEFAULT NULL,
`status`int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`score`floatDEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,
`content` mediumblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,
`fetchInterval`int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch`int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
7、执行爬行操作:
执行完在mysql中即可以查看到爬虫抓取的内容
8、执行索引操作:
#官方解决办法:
#http://mail-archives.apache.org/mod_mbox/nutch-user/201307.mbox/%3CCAErFeLSwoZ2UhxMA1iYi7H-L52Ojo-j9KoWT7xDittBzvB0F0A@mail.gmail.com%3E
######################
20141103
问题解决办法:重新编译一下即可
又出现一个新的问题:
./nutch crawl ../urls -depth 3
InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora storage class.
Exception in thread "main" java.lang.RuntimeException: job failed: name=inject ../urls, jobid=job_local713211278_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
./nutch crawl ../urls -depth 3 -topN 5
InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora storage class.
Exception in thread "main" java.lang.RuntimeException: job failed: name=inject ../urls, jobid=job_local1302478362_0001
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
文章参考:
官网资料:http://nlp.solutions.asia/?p=362
https://issues.apache.org/jira/browse/NUTCH-1473