Nutch 安装手册

操作系统:CentOS 6.4

HBase

HBase可以使用本地文件系统,但不能保证数据持久性. 正式环境需要使用HDFS作为后端存储.

修改主机名和hosts文件

设置计算机的主机名,将解析添加到hosts文件

编辑 /etc/sysconfig/network 文件

NETWORKING=yes

HOSTNAME=nutch.bis.com.cn

修改完重启才能生效,可以使用执行下面命令,然后重新登录即生效.

hostname nutch.bis.com.cn

编辑/etc/hosts文件

[root@nutch apache-nutch-2.2.1]# cat /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4

::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

172.35.0.118 nutch.bis.com.cn

下载安装HBase

Nutch 2.X支持的HBase版本为 0.90.4, 理论上0.90.x分支版本都支持. 其余的版本会因jar包不兼容出错.
官网 hbase-0.90.4.tar.gz

下载到本地后,解压压缩包

tar xvf hbase-0.90.4.tar.gz

cd hbase-0.90.4

编辑 conf/hbase-site.xml, 设置hbase.rootdir和hbase.zookeeper.property.dataDir. 如果没设置, hbase将数据保存至/tmp目录下,重启后丢失.

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>

    <name>hbase.rootdir</name>

    <value>file:///opt/hbase</value>

  </property>

  <property>

    <name>hbase.zookeeper.property.dataDir</name>

    <value>/opt/zookeeper</value>

  </property>

</configuration>

将value的值替换为需要存储数据的路径.

启动HBase

./bin/start-hbase.sh

Nutch

下载nutch最新版本 2.2.1
官网 apache-nutch-2.2.1-src.tar.gz

解压nutch

tar xvf apache-nutch-2.2.1-src.tar.gz

cd apache-nutch-2.2.1

编辑conf/nutch-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

  <name>storage.data.store.class</name>

  <value>org.apache.gora.hbase.store.HBaseStore</value>

  <description>The Gora DataStore class for storing and retrieving data.

   Currently the following stores are available:

  org.apache.gora.sql.store.SqlStore

    Default store. A DataStore implementation for RDBMS with a SQL interface.

    SqlStore uses JDBC drivers to communicate with the DB. As explained in

    ivy.xml, currently >= gora-core 0.3 is not backwards compatable with

    SqlStore.

  org.apache.gora.cassandra.store.CassandraStore

    Gora class for storing data in Apache Cassandra.

  org.apache.gora.hbase.store.HBaseStore

    Gora class for storing data in Apache HBase.

  org.apache.gora.accumulo.store.AccumuloStore

    Gora class for storing data in Apache Accumulo.

  org.apache.gora.avro.store.AvroStore

    Gora class for storing data in Apache Avro.

  org.apache.gora.avro.store.DataFileAvroStore

    Gora class for storing data in Apache Avro. DataFileAvroStore is

    a file based store which uses Avro's DataFile{Writer,Reader}'s as a backend.

    This datastore supports mapreduce.

  org.apache.gora.memory.store.MemStore

    Gora class for storing data in a Memory based implementation for tests.

  </description>

</property>

<property>

  <name>http.content.limit</name>

  <value>6553600</value>

  <description>The length limit for downloaded content using the http

  protocol, in bytes. If this value is nonnegative (>=0), content longer

  than it will be truncated; otherwise, no truncation at all. Do not

  confuse this setting with the file.content.limit setting.

  </description>

</property>

<property>

  <name>http.agent.name</name>

  <value>nbot</value>

  <description>HTTP 'User-Agent' request header. MUST NOT be empty -

  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

        http.robots.agents

        http.agent.description

        http.agent.url

        http.agent.email

        http.agent.version

  and set their values appropriately.

  </description>

</property>

<property>

  <name>http.robots.agents</name>

  <value>nbot,*</value>

  <description>The agent strings we'll look for in robots.txt files,

  comma-separated, in decreasing order of precedence. You should

  put the value of http.agent.name as the first agent name, and keep the

  default * at the end of the list. E.g.: BlurflDev,Blurfl,*

  </description>

</property>

<property>

  <name>http.accept.language</name>

  <value>zh-cn,ja-jp,en-us,en-gb,en;q=0.7,*;q=0.3</value>

  <description>Value of the "Accept-Language" request header field.

  This allows selecting non-English language as default one to retrieve.

  It is a useful setting for search engines build for certain national group.

  </description>

</property>

<property>

  <name>parser.character.encoding.default</name>

  <value>utf-8</value>

  <description>The character encoding to fall back to when no other information

  is available</description>

</property>

</configuration>

编辑 ivy/ivy.xml 将下行的gora-hbase的注释去掉 ,注意name为“gora-hbase”,org相同的有多行。

<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />

conf/gora.properties 中添加下面一行, 设置默认存储为HBase

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

编译nutch

安装ant

yum install -y ant

执行编译命令

ant runtime

需要下载很多jar包,需要等待一段时间.

编译成功之后,会生成runtime目录.现在可以使用nutch.

cd runtime/local/bin

./nutch inject /path/to/urls_folder

./nutch readdb -stats

默认log文件是runtime/local/logs/hadoop.log

转载于:https://my.oschina.net/junfrank/blog/286406

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值