ant nutch mysql_Nutch2Tutorial

Nutch 2.X Tutorial

Introduction

This document describes how to get Nutch 2.X to use HBase as a storage backend for Gora. It is assumed that you have a working knowledge of configuring Nutch 1.X, as currently configuration in 2.X is more complex. It is important to take this in to consideration before progressing any further. We therefore strongly advise that you check out the Nutch 1.X tutorial.

Obtaining Software and Configuration

Grab the latest distribution of Nutch 2.X from here. Do NOT build the source yet. From now on we will refer to the directory where the Nutch code resides as $NUTCH_HOME.

Download and configure HBase 0.98.8-hadoop2. You can get it here (N.B. Each version of Gora is tied to a particular version of HBase, we therefore suggest you use this version if possible. If you decide to use another version of HBase please do not be surprised if the stack does not work. You should also obtain current documentation for HBase however please again take into consideration that the version of HBase we recommend you use may not correlate to the current documentation. Please keep this in mind and use your initiative.

Specify the GORA backend in $NUTCH_HOME/conf/nutch-site.xml along with all of the other Configuration options suggested within the Nutch 1.x tutorial.

storage.data.store.class

org.apache.gora.hbase.store.HBaseStore

Default class for storing data

Ensure the HBase gora-hbase dependency is available in $NUTCH_HOME/ivy/ivy.xml

In addition add the missing hbase-common-0.98.8-hadoop2.jar transitive dependency, this is a bug in gora-hbase 0.6.1 as described here. This bug is removed in current Gora development.

Ensure that HBaseStore is set as the default datastore in $NUTCH_HOME/conf/gora.properties. Other documentation for HBaseStore can be found here.

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

N.B. It's probably worth checking and setting all your usual configuration settings within $NUTCH_HOME/conf/nutch-site.xml etc. before progressing.

Compile Nutch -> via

ant runtime

Make sure HBase is started and working properly as per the quick start tutorial.

Create a list of URLs as you would do within the Nutch 1.X tutorial.

Invoke Nutch

You should then be able to inject URLs into HBase. Try going to $NUTCH_HOME/runtime/local/bin and do :

nutch inject /someseedDir

nutch readdb

Whats Next

You may want to check out the documentation for the Nutch Web Application and then the Nutch REST API as this gives a comprehensive overview of ongoing work with making Nutch 2.X easier to use.

Extra/Important Notes

N.B. The crawl command in the bin/nutch script is deprecated. You should use individual commands or alternatively use the bin/crawl script... which effectively chains together individual commands.

You should find more details in the logs on $NUTCH_HOME/runtime/local/logs/hadoop.log.

N.B. It's possible to encounter the following exception: java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration; this is caused by the fact that sometimes the hbase TEST jar is deployed in the lib dir. To resolve this just copy the lib over from your installed HBase dir into the build lib dir. (This issue is currently in progress).

N.B. The process of using the other datastore implementations offered within Gora e.g. Apache Cassandra, Accumulo, can be achieved simply by tweaking the above settings prior to compiling the Nutch code.

N.B. As of Apache Gora release 0.3, the gora-sql 0.1.1-incubating artifact is deprecated. The choice is to downgrade to Nutch 2.1 if you wish to use MySQL or HSQLDB as a Gora backend.

For more details of the command line interface options, please see here, or of course run ./bin/nutch which will print usage to std out. Finally, for a more detailed Nutch (1.X) tutorial, please see here

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
nutch javax.net.ssl.sslexception : could not generate dh keypair 是一个SSL异常,意味着Nutch无法生成DH密钥对。 TLS(Transport Layer Security)是一种加密协议,用于保护在网络上进行的通信。在TLS握手期间,服务器和客户端会协商加密算法和生成共享密钥对。 DH(Diffie-Hellman)密钥交换是TLS协议中常用的一种加密算法。它允许服务器和客户端在不直接传递密钥的情况下,通过交换公钥来生成共享密钥。 nutch javax.net.ssl.sslexception : could not generate dh keypair 错误意味着Nutch无法生成DH密钥对。这可能是由于以下几个原因导致的: 1. Java安全性策略限制:Java默认情况下,限制了密钥长度。您可以尝试通过修改Java安全性策略文件来解决此问题。 2. 加密算法不受支持:您使用的Java版本可能不支持所需的加密算法。您可以尝试升级到较新的Java版本。 3. 随机数生成器问题:DH密钥对需要使用随机数生成器生成随机数。但是,如果随机数生成器不可用或出现故障,就会出现此错误。您可以尝试重新配置随机数生成器或更换可靠的实现。 4. SSL证书问题:此错误可能是由于证书问题引起的。您可以检查证书是否过期或不匹配,并尝试更新或更换证书。 针对这个错误,您可以逐一排查上述情况,并尝试相应的解决方法来解决该问题。如果问题仍然存在,您可能需要进一步的调查和故障排除来确定准确的原因并解决问题。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值