[wordpress搬家]nutch的二三事 — 2.2.1版本抓取到mysql中

[2013.12.25]


有一篇对应的博文博文,不过是2.1版本的,在最新的2.2.1版本中有很多问题,所以强烈建议大家一定要完全把这篇文章看完后再着手操作,不要跟着我一起走弯路。

流水账一样的配置过程。

mysql配置:


CREATE DATABASE nutch DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci;

CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` mediumtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
ROW_FORMAT=COMPRESSED
DEFAULT CHARSET=utf8mb4;

ivy/ivy.xml中需要uncomment这两行,让gora支持mysql

<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
<dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>

conf/gora.properties中需要写好数据库信息

gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=xxxxx
gora.sqlstore.jdbc.password=xxxxx

conf/gora-sql-mapping.xml中替换两个primarykey对应的length,因为ID变为了utf8,所以数据变长了。

<primarykey column="id" length="767"/>

另外就是关于抓取的,配置conf/nutch-site.xml,加入爬虫信息:

<property>
<name>http.agent.name</name>
<value>Ade's spider</value>
</property>

<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>

<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>

由于还需要ivy下载一个sql connector与gora-sql,所以再ant编译一遍。

下面就可以开始抓取了:

cd ./runtime/local
mkdir -p urls
echo 'http://www.promenade.me' > urls/seed.txt
bin/nutch crawl urls -depth 3 -topN 5

有可能会遇到问题:

[root@AY131218101252507ad0Z local]# bin/nutch crawl urls -depth 3 -topN 5
InjectorJob: Using class org.apache.gora.sql.store.SqlStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 0
Exception in thread "main" java.lang.RuntimeException: job failed: name=generate: null, jobid=job_local177967844_0002
	at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
	at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:199)
	at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
	at org.apache.nutch.crawl.Crawler.run(Crawler.java:152)
	at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)

查看logs/hadoop.log会说是一个Utf8类传入了空值。网上有一篇 Nutch2.0配置安装异常集锦 ,里面有对应的解释。
找到
nutch/src/java/org/apache/nutch/crawl/GeneratorReducer.java,然后看其100行左右:

batchId = new Utf8(conf.get(GeneratorJob.BATCH_ID));

//改为

int randomSeed = Math.abs(new Random().nextInt());
String batchIdStr = (System.currentTimeMillis() / 1000) + "-" + randomSeed;
batchId = new Utf8( batchIdStr );

//别忘了在最上面加上
import java.util.Random;

之后需要重新编译一遍,然后再去抓取,又出现异常,查看hadoop.log:

java.lang.Exception: java.lang.NoSuchMethodError: org.apache.gora.persistency.Persistent.getSchema()Lorg/apache/avro/Schema;
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.NoSuchMethodError: org.apache.gora.persistency.Persistent.getSchema()Lorg/apache/avro/Schema;
	at org.apache.gora.sql.store.SqlStore.put(SqlStore.java:591)
	at org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:65)
	at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:638)
	at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
	at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:191)
	at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:88)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
	at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
	at java.util.concurrent.FutureTask.run(FutureTask.java:166)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:701)

突然想到在ivy/ivy.xml中有这样写道:

Uncomment this to use SQL as Gora backend. It should be noted that the 
gora-sql 0.1.1-incubating artifact is NOT compatable with gora-core 0.3. Users should 
downgrade to gora-core 0.2.1 in order to use SQL as a backend.

好吧,就在这个提示上面一行,修改一下gora-core的版本为0.2.1。 再编译,再重来… 不出所料,又有问题,这回的错误是:

Unknown column 'batchId' in 'field list'

麻利儿的检查一下数据库哪里有问题,这个batchId就应该是刚才utf8错误的那个batchId,在mysql表中加一个字段呗。

CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` mediumtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
`batchId` varchar(767) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
ROW_FORMAT=COMPRESSED
DEFAULT CHARSET=utf8mb4;

好吧,再运行,居然..居然开始抓取了…

mysql> select count(*) from webpage;
+----------+
| count(*) |
+----------+
|       65 |
+----------+
1 row in set (0.00 sec)


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值