hbase solr 插件_使用Solr+Hbase-solr(Hbase-indexer)配置实现HBase二级索引

本文详细介绍了如何在CentOS6.7环境下,搭建HBase二级索引环境,使用HBase-solr(hbase-indexer)插件结合Solr实现。内容包括环境准备、hbase-indexer和HBase的部署、Solr的启动,以及数据索引测试,适合有一定Linux和Hadoop基础的读者。
摘要由CSDN通过智能技术生成

前言:

因为项目需要,试着搭建了一下HBase二级索引的环境,网上看了一些教程,无一不坑,索性整理一份比较完整的。本文适当的精简和绕过了一些“老司机一看就知道”的内容,适合刚接触这一领域但是有一定Linux和Hadoop基础的读者,不适合完全初学者。

环境约束:

OS:CentOS6.7-x86_64

JDK:jdk1.7.0_109

hadoop-2.6.0+cdh5.4.1

hbase-solr-1.5+cdh5.4.1 (hbase-indexer-1.5-cdh5.4.1)

solr-4.10.3-cdh5.4.1

zookeeper-3.4.5-cdh5.4.1

hbase-1.0.0-cdh5.4.1

一、基本环境准备

1.一个3节点Hadoop集群,服务器计划角色分配如下:

服务器角色分配

先把Namenode、Datanode、zookeeper、Journalnode、ZKFC跑起来,具体技术自行突破,不是本文重点,无需多言。

2.下载好所需的CDH版本软件:

在文首的链接页面下载好tarball,需要注意的是HBase-solr的tarball是整个项目文件,但是我们用到的只是它的部署文件,解压缩hbase-solr-1.5+cdh5.4.1的tarball,在 hbase-solr-1.5-cdh5.4.1\hbase-indexer-dist\target 下找到hbase-indexer-1.5-cdh5.4.1.tar.gz,后面会用到。

二、部署hbase-indexer

将hbase-indexer-1.5-cdh5.4.1.tar.gz拷贝到node2或者node3上

解压缩hbase-indexer-1.5-cdh5.4.1.tar.gz:

tar zxvf hbase-indexer-1.5-cdh5.4.1.tar.gz

修改hbase-indexer的参数:

vim hbase-indexer-1.5-cdh5.4.1/conf/hbase-indexer-site.xml

hbaseindexer.zookeeper.connectstring

node1:2181,node2:2181,node3:2181

hbase.zookeeper.quorum

node1,node2,node3

配置hbase-indexer-env.sh:

vim hbase-indexer-1.5-cdh5.4.1/conf/hbase-indexer-env.sh

修改JAVA_HOME

# Set environment variables here.

# This script sets variables multiple times over the course of starting an hbase-indexer process,

# so try to keep things idempotent unless you want to take an even deeper look

# into the startup scripts (bin/hbase-indexer, etc.)

# The java implementation to use. Java 1.6 required.

export JAVA_HOME=/usr/java/jdk1.7.0/

#根据实际环境修改

# Extra Java CLASSPATH elements. Optional.

# export HBASE_INDEXER_CLASSPATH=

# The maximum amount of heap to use, in MB. Default is 1000.

# export HBASE_INDEXER_HEAPSIZE=1000

# Extra Java runtime options.

# Below are what we set by default. May only work with SUN JVM.

# For more on why as well as other possible settings,

# see http://wiki.apache.org/hadoop/PerformanceTuning

export HBASE_INDEXER_OPTS="$HBASE_INDEXER_OPTS -XX:+UseConcMarkSweepGC"

使用scp命令把整个hbase-indexer-1.5-cdh5.4.1复制到node3上

三、部署HBase

解压缩hbase的tarball

tar zxvf hbase-1.0.0-cdh5.4.1.tar.gz

同样要修改hbase-site.xml

vim hbase-1.0.0-cdh5.4.1/conf/hbase-site.xml

需要在标签内增加如下内容:

hbase.rootdir

hdfs://node1:9000/hbase

The directory shared by RegionServers

hbase.master

node1:60000

hbase.cluster.distributed

true

The mode the cluster will be in.Possible values are

false: standalone and pseudo-distributed setups with managed Zookeeper

true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)

hbase.replication

true

SEP is basically replication, so enable it

replication.source.ratio

1.0

Source ratio of 100% makes sure that each SEP consumer is actually used (otherwise, some can sit idle, especially with small clusters)

replication.source.nb.capacity

1000

Maximum number of hlog entries to replicate in one go. If this is large, and a consumer takes a while to process the events, the HBase rpc call will time out.

replication.replicationsource.implementation

com.ngdata.sep.impl.SepReplicationSource

A custom replication source that fixes a few things and adds some functionality (doesn't interfere with normal replication usage).

hbase.zookeeper.quorum

node1,node2,node3

The directory shared by RegionServers

hbase.zookeeper.property.dataDir

/home/HBasetest/zookeeperdata

Property from ZooKeeper's config zoo.cfg.

The directory where the snapshot is stored.

类似的,修改hbase-env.sh

vim hbase-1.0.0-cdh5.4.1/conf/hbase-env.sh

修改JAVA_HOME和HBASE_HOME

# Set environment variables here.

# This script sets variables multiple times over the course of starting an hbase process,

# so try to keep things idempotent unless you want to take an even deeper look

# into the startup scripts (bin/hbase, etc.)

# The java implementation to use. Java 1.7+ required.

# export JAVA_HOME=/usr/java/jdk1.6.0/

export JAVA_HOME=/opt/jdk1.7.0_79

export HBASE_HOME=/home/HBasetest/hbase-1.0.0-cdh5.4.1

#根据实际填写

# Extra Java CLASSPATH elements. Optional.

# export HBASE_CLASSPATH=

# The maximum amount of heap to use, in MB. Default is 1000.

# export HBASE_HEAPSIZE=1000

# Uncomment below if you intend to use off heap cache.

# export HBASE_OFFHEAPSIZE=1000

# For example, to allocate 8G of offheap, to 8G:

# export HBASE_OFFHEAPSIZE=8G

# Extra Java runtime options.

# Below are what we set by default. May only work with SUN JVM.

# For more on why as well as other possible settings,

# see http://wiki.apache.org/hadoop/PerformanceTuning

export HBASE_OPTS="-XX:+UseConcMarkSweepGC"

将hbase-indexer-1.5-cdh5.4.1/lib目录下的这4个文件复制到hbase-1.0.0-cdh5.4.1/lib/目录下

hbase-sep-api-1.5-cdh5.4.1.jar

hbase-sep-impl-1.5-hbase1.0-cdh5.4.1.jar

hbase-sep-impl-common-1.5-cdh5.4.1.jar

hbase-sep-tools-1.5-cdh5.4.1.jar

修改hbase-1.0.0-cdh5.4.1/conf/regionservers为如下内容:

node2

node3

然后将目录hbase-1.0.0-cdh5.4.1复制到node2和node3上面

四、部署Solr

直接在node1上解压缩就好。。。

五、运行测试

1.运行HBase

在node1上执行:

./hbase-1.0.0-cdh5.4.1/bin/start-hbase.sh

2.运行HBase-indexer

分别在node2和node3上执行:

./hbase-indexer-1.5-cdh5.4.1/bin/hbase-indexer server

如果想以后台方式运行,可以使用screen或者nohup

3.运行Solr

分别在node1上进入solr下面的sample子目录,执行:

java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkHost=node1:2181,node3:2181,node4:2181/solr -jar start.jar

同样,如果想以后台方式运行,可以使用screen或者nohup

使用http://node1:8983/solr/#/访问solr的主页

六、数据索引测试

将Hadoop集群、HBase、HBase-Indexer、Solr都跑起来之后,首先用HBase创建一个数据表:

在任一node上的HBase安装目录下运行:

./bin/hbase shell

create 'indexdemo-user', { NAME => 'info', REPLICATION_SCOPE => '1' }

在部署了HBase-Indexer的节点上,进入HBase-Indexer部署目录,使用HBase-Indexer的demo下的配置文件创建一个索引:

./bin/hbase-indexer add-indexer -n myindexer -c .demo/user_indexer.xml -cp solr.zk=node1:2181,node2:2181,node3:2181/solr -cp solr.collection=collection1

编辑hbase-indexer-1.5-cdh5.4.1/demo/下的字段定义文件:

保存为indexdemo-indexer.xml

添加indexer实例

在hbase-indexer-1.5-cdh5.4.1/demo下运行:

./bin/hbase-indexer add-indexer -n myindexer -c indexdemo-indexer.xml -cp \

solr.zk=node1:2181,node2:2181,node3:2181/solr -cp solr.collection=collection1 -z node1,node2,node3

准备一些测试数据,因为项目需要对千万级以上的记录进行索引的测试,所以用命令行手敲的方式插入数据有点不大现实,HBase也支持使用shell命令批量执行以文本方式存储的命令集合,但在千万级别这个数量级的数据量面前还是很苍白,最后我还是选择了用Java编程的方式实现快速的批量插入记录。

Eclipse里面新建一个Java工程,导入HBase部署目录下lib内的所有内容。程序源代码如下:

package com.hbasetest.hbtest;

import java.io.IOException;

import java.util.ArrayList;

import java.util.List;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.client.HTable;

import org.apache.hadoop.hbase.client.Put;

public class DataInput {

private static Configuration configuration;

static {

configuration = HBaseConfiguration.create();

configuration.set("hbase.zookeeper.property.clientPort", "2181");

configuration.set("hbase.zookeeper.quorum", "node1,node2,node3");

}

public static void main(String[] args) {

try {

List putList = new ArrayList();

HTable table = new HTable(configuration, "indexdemo-user");

for (int i =0; i<=14000000 ;i++)

{

Put put = new Put(Integer.toString(i).getBytes());

put.add("info".getBytes(), "firstname".getBytes(), ("Java.value.firstname"+Integer.toString(i)).getBytes());

put.add("info".getBytes(), "lastname".getBytes(), ("Java.value.lastname"+Integer.toString(i)).getBytes());

putList.add(put);

System.out.println("put successfully! " + Integer.toString(i) );

} table.put(putList);

} catch (IOException e) {

e.printStackTrace();

}

}

}

这段代码使用了批量put的办法,如果运行这个程序的机器内存不够大,建议做问题分治,多搞几个putList。

剩下的检索测试就简单了,不再赘述。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值