nutch2.2 mysql 安装

最新推荐文章于 2021-05-13 06:51:05 发布

s20082043

最新推荐文章于 2021-05-13 06:51:05 发布

阅读量644

点赞数

分类专栏： nutch

本文链接：https://blog.csdn.net/s20082043/article/details/43148667

版权

nutch 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

官网介绍如下：

Apache Nutch is a highly extensible and scalable open source web crawler software project.

nutch是一个高度可扩展和可伸缩的开源的网络爬虫项目

nutch1.x与nutch2.x的区别：

storage is abstracted away from any specific underlying data store by using Apache Gora for handling object to persistent mappings. This means we can implement an extremely flexibile model/stack for storing everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) into a number of NoSQL storage solutions.

存储使用Apache Gora来处理对象持久化映射，从而使nutch的存储脱离了任何底层数据存储

先查看linux环境，如下：

现在就nutch2.2.1 + mysql5.6部署如下：

1 安装 mysql5.6 (略)

安装完成mysql之后对my.cnf进行如下修改

2 进入mysql 创建表

CREATE TABLE `webpage` (

`id` varchar(767) NOT NULL,

`headers` blob,

`text` longtext DEFAULT NULL,

`status` int(11) DEFAULT NULL,

`markers` blob,

`parseStatus` blob,

`modifiedTime` bigint(20) DEFAULT NULL,

`prevModifiedTime` bigint(20) DEFAULT NULL,

`score` float DEFAULT NULL,

`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,

`batchId` varchar(32) CHARACTER SET latin1 DEFAULT NULL,

`baseUrl` varchar(767) DEFAULT NULL,

`content` longblob,

`title` varchar(2048) DEFAULT NULL,

`reprUrl` varchar(767) DEFAULT NULL,

`fetchInterval` int(11) DEFAULT NULL,

`prevFetchTime` bigint(20) DEFAULT NULL,

`inlinks` mediumblob,

`prevSignature` blob,

`outlinks` mediumblob,

`fetchTime` bigint(20) DEFAULT NULL,

`retriesSinceFetch` int(11) DEFAULT NULL,

`protocolStatus` blob,

`signature` blob,

`metadata` blob,

PRIMARY KEY (`id`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8;

3 nutch2.2安装

3.1 下载nutch2.2源码

3.2 下载ant

3.3 进入nutch2.2目录如图(注：在编译之前没有 /runtime 目录)：

3.4 执行:ant命令进行编译,编译之后会发现目录中增加了一个 runtime目录该目录下会有两个目录deploy和local 分别表示：部署模式和本地模式

deploy和local目录结构如下：

3.5 我们进入local目录进行本地部署

进入conf目录:

在groa.properties文件中修改mysql的数据库连接如下：

gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://127.0.0.1:3306/nutch
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=root

在nutch-site.xml文件中<configuration></configuration>中增加属性如下：

<property>
<name>http.agent.name</name>
<value>nutch01</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.accept.language</name>
<value>en-us,en-gb,en,zh-cn,zh-tw;q=0.7,*;q=0.3</value>
<description>Value of the "Accept-Language" request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available:
org.apache.gora.sql.store.SqlStore 
Default store. A DataStore implementation for RDBMS with a SQL interface.
SqlStore uses JDBC drivers to communicate with the DB. As explained in
ivy.xml, currently >= gora-core 0.3 is not backwards compatable with
SqlStore.
org.apache.gora.cassandra.store.CassandraStore
Gora class for storing data in Apache Cassandra.
org.apache.gora.hbase.store.HBaseStore 
Gora class for storing data in Apache HBase.
org.apache.gora.accumulo.store.AccumuloStore
Gora class for storing data in Apache Accumulo.
org.apache.gora.avro.store.AvroStore
Gora class for storing data in Apache Avro.
org.apache.gora.avro.store.DataFileAvroStore
Gora class for storing data in Apache Avro. DataFileAvroStore is
a file based store which uses Avro's DataFile{Writer,Reader}'s as a backend.
This datastore supports mapreduce.
org.apache.gora.memory.store.MemStore
Gora class for storing data in a Memory based implementation for tests.
</description>
</property>

<property>
<name>generate.batch.id</name>
<value>cdv</value>
</property>

3.6 退回nutch2.2主目录进入ivy目录

修改ivy.xml文件：

1) 注释掉： <dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />

2) 放开注释： <dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
<dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>

3.7 退回主目录执行ant命令对nutch2.2重新编译

3.8 进入runtime/local目录执行：1) mkdir urls

2) 创建first.txt文件,输入http://www.tianya.cn

3.9 返回local目录，执行：bin/nutch

将列出所有的命令其中：

inject 将需要抓取的urls导入数据库

generate 从数据库中获取需要抓取批量的url信息放入抓取队列

fetch 对抓取队列中的urls 进行抓取并标示抓取标识

parse 对抓取的网页就进行解析

updatedb 把新生成的连接更新到数据库中

solrinedx 对解析的内容放到solr全文检索库中

3.10 查看inject 命令详情：

执行如下命令：bin/nutch urls ：

会报如下错误：

这个问题是由于：

在创建webpage表时对属性'text'设置为：longtext引起的,需修改如下：

1)修改数据库类型：改为BLOB

2)修改conf/groa-sql-mapping.xml中：

3) 进入nutch2.2主目录重新编译

重新执行命令：bin/nutch urls 发现mysql数据库中多了条记录

-crawlId 的作用是什么呢？执行命令：bin/nutch urls -crawlId nutch01 之后却报错如下：

报错原因：

在网上多方查找也没找到具体原因，最好只得查看源码进行分析：

1)在InjectJob.java中发下：

if ("-crawlId".equals(args[i])) {
getConf().set(Nutch.CRAWL_ID_KEY, args[i+1]);
i++;
}

也就是说如果设置了-crawlId参数会将该参数设置到configure配置文件中做为全局变量

2) 在StorageUtils.java中

String crawlId = conf.get(Nutch.CRAWL_ID_KEY, "");

if (!crawlId.isEmpty()) {
conf.set("schema.prefix", crawlId + "_");
} else {
conf.set("schema.prefix", "");
}

到此终于明白,原来crawlId是做为表的前缀出现的,因此在执行命令时,如果需要用到-crawlId参数需要将数据库中的webpage命名做相应的修改：如bin/nutch urls -crawlId nutch01 则需要将webpage表名改为：nutch01_webpage

至此，本以为执行bin/nutch urls -crawlId nutch01命令总不会再报错了吧，但是事与愿违报错如下：

实在搞不清楚到底是什么意思，于是便想起logs/hadoop.log中查看错误的详情,详情如下：

原来是在gora的时候少了方法，那就说明nutch在使用gora-0.3.jar时缺少方法,于是查看了一下gora-0.3中的Persistent.java发现的确缺少getSchema()方法没办法又重新下载了gora-0.2.1源码发现该版本中该类有对应的方法,于是重新回到nutch2.2根目录

进入ivy/ivy.xml：

修改 <dependency org="org.apache.gora" name="gora-core" rev="0.3" conf="*->default"/>为：

<dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/>

之后重新编译nutch2.2之后再运行bin/nutch urls -crawlId nutch01 终于没有错误了

之后根据bin/nutch 对generate fetch parse updatedb 的参数说明进行逐步执行。待完善....