nutch 2.2.1 mysql_nutch2.2.1+mysql抓取数据

基本环境:linux centos6.5 nutch2.2.1 源码包, mysql 5.5 ,elasticsearch1.1.1, jdk1.7

1、下载地址http://mirror.bjtu.edu.cn/apache/nutch/2.2.1/ 解压

2、修改数据存储方式是mysql

修改nutch根目录/ivy/ivy.xml文件,原来mysql数据存储是注释的。

104

107

108

109

110

111

112

3、修改连接数据库地址和用户名,在 nutch根目录/conf/gora.properties 将原来的注释掉

#gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver

#gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest

#gora.sqlstore.jdbc.user=sa

#gora.sqlstore.jdbc.password=

# MySQL properties #

###############################

gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver

gora.sqlstore.jdbc.url=jdbc:mysql://ip:3306/nutch? useUnicode=true&characterEncoding=utf8&autoReconnect=true&zeroDateTimeBehavior=convertToNull

gora.sqlstore.jdbc.user=user

gora.sqlstore.jdbc.password=pwd

4、修改修改conf的nutch-site.xml

http.agent.name

My Spider

http.accept.language

ja-jp,zh-cn,en-us,en-gb,en;q=0.7,*;q=0.3

parser.character.encoding.default

utf-8

The character encoding to fall back to when no other information

is available

storage.data.store.class

org.apache.gora.sql.store.SqlStore

plugin.includes

protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic

5、使用ant 编译源码

在nutch 目录下执行 ant

job:

[jar] Building jar:/home/hadoop/nutch221/build/apache-nutch-2.2.1.job

runtime:

[mkdir] Created dir:/home/hadoop/nutch221/runtime

[mkdir] Created dir:/home/hadoop/nutch221/runtime/local

[mkdir] Created dir:/home/hadoop/nutch221/runtime/deploy

[copy] Copying1 file to /home/hadoop/nutch221/runtime/deploy

[copy] Copying2 files to /home/hadoop/nutch221/runtime/deploy/bin

[copy] Copying1 file to /home/hadoop/nutch221/runtime/local/lib

[copy] Copying1 file to /home/hadoop/nutch221/runtime/local/lib/native[copy] Copying26 files to /home/hadoop/nutch221/runtime/local/conf

[copy] Copying2 files to /home/hadoop/nutch221/runtime/local/bin

[copy] Copying100 files to /home/hadoop/nutch221/runtime/local/lib

[copy] Copying106 files to /home/hadoop/nutch221/runtime/local/plugins

[copy] Copied2 empty directories to 2 empty directories under /home/hadoop/nutch221/runtime/local/test

BUILD SUCCESSFUL

Total time:41 seconds 编译成功。

6 创建数据库

CREATE DATABASE nutch DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;

CREATE TABLE `webpage` (`id` varchar(767) CHARACTER SET latin1 NOT NULL,

`headers` blob,

`text` mediumtext DEFAULT NULL,

`status`int(11) DEFAULT NULL,

`markers` blob,

`parseStatus` blob,

`modifiedTime` bigint(20) DEFAULT NULL,

`score`floatDEFAULT NULL,

`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,

`baseUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,

`content` mediumblob,

`title` varchar(2048) DEFAULT NULL,

`reprUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,

`fetchInterval`int(11) DEFAULT NULL,

`prevFetchTime` bigint(20) DEFAULT NULL,

`inlinks` mediumblob,

`prevSignature` blob,

`outlinks` mediumblob,

`fetchTime` bigint(20) DEFAULT NULL,

`retriesSinceFetch`int(11) DEFAULT NULL,

`protocolStatus` blob,

`signature` blob,

`metadata` blob,

PRIMARY KEY (`id`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8;

7、执行爬行操作:

执行完在mysql中即可以查看到爬虫抓取的内容

8、执行索引操作:

#官方解决办法:

#http://mail-archives.apache.org/mod_mbox/nutch-user/201307.mbox/%3CCAErFeLSwoZ2UhxMA1iYi7H-L52Ojo-j9KoWT7xDittBzvB0F0A@mail.gmail.com%3E

######################

20141103

问题解决办法:重新编译一下即可

又出现一个新的问题:

./nutch crawl ../urls -depth 3

InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora storage class.

Exception in thread "main" java.lang.RuntimeException: job failed: name=inject ../urls, jobid=job_local713211278_0001

at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)

at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)

at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)

./nutch crawl ../urls -depth 3 -topN 5

InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora storage class.

Exception in thread "main" java.lang.RuntimeException: job failed: name=inject ../urls, jobid=job_local1302478362_0001

at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)

at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)

文章参考:

官网资料:http://nlp.solutions.asia/?p=362

https://issues.apache.org/jira/browse/NUTCH-1473

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值