Cloudera Search之三 Hbase二级索引方案

最新推荐文章于 2024-01-23 02:36:27 发布

wandy0211

最新推荐文章于 2024-01-23 02:36:27 发布

阅读量488

点赞数

分类专栏： Hbase

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/wjandy0211/article/details/89681050

版权

Hbase 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

Hbase 二级索引方案

概述

在 Hbase 中,表的 RowKey 按照字典排序, Region 按照 RowKey 设置 split point 进行 shard，

通过这种方式实现的全局、分布式索引. 成为了其成功的最大的砝码。

然而单一的通过 RowKey 检索数据的方式,不再满足更多的需求，查询成为 Hbase 的瓶颈，人

们更加希望像 Sql 一样快速检索数据，可是，Hbase 之前定位的是大表的存储，要进行这样

的查询，往往是要通过类似 Hive、Pig 等系统进行全表的 MapReduce 计算，这种方式既浪费

了机器的计算资源，又因高延迟使得应用黯然失色。于是，针对 HBase Secondary Indexing

的方案出现了。

Solr

Solr 是一个独立的企业级搜索应用服务器，是 Apache Lucene 项目的开源企业搜索平台,

其主要功能包括全文检索、命中标示、分面搜索、动态聚类、数据库集成，以及富文本（如

Word、PDF）的处理。Solr 是高度可扩展的，并提供了分布式搜索和索引复制。Solr 4 还增

加了 NoSQL 支持，以及基于 Zookeeper 的分布式扩展功能 SolrCloud。SolrCloud 的说明可

以参看：SolrCloud 分布式部署。它的主要特性包括：高效、灵活的缓存功能，垂直搜索功

能，Solr 是一个高性能，采用 Java5 开发，基于 Lucene 的全文搜索服务器。同时对其进行

了扩展，提供了比 Lucene 更为丰富的查询语言，同时实现了可配置、可扩展并对查询性能

进行了优化，并且提供了一个完善的功能管理界面，是一款非常优秀的全文搜索引擎。

Solr 可以高亮显示搜索结果，通过索引复制来提高可用，性，提供一套强大 Data Schema

来定义字段，类型和设置文本分析，提供基于 Web 的管理界面等。

Key-Value Store Indexer

这个组件非常关键，是 Hbase 到 Solr 生成索引的中间工具。

在 CDH5.3.2 中的 Key-Value Indexer 使用的是 Lily HBase NRT Indexer 服务.

Lily HBase Indexer 是一款灵活的、可扩展的、高容错的、事务性的，并且近实时的处理

HBase 列索引数据的分布式服务软件。它是 NGDATA 公司开发的 Lily 系统的一部分，已开放

源代码。Lily HBase Indexer 使用 SolrCloud 来存储 HBase 的索引数据，当 HBase 执行写

入、更新或删除操作时，Indexer 通过 HBase 的 replication 功能来把这些操作抽象成一系

列的 Event 事件，并用来保证写入 Solr 中的 HBase 索引数据的一致性。并且 Indexer 支持

用户自定义的抽取，转换规则来索引 HBase 列数据。Solr 搜索结果会包含用户自定义的

columnfamily:qualifier 字段结果，这样应用程序就可以直接访问 HBase 的列数据。而且

Indexer 索引和搜索不会影响 HBase 运行的稳定性和 HBase 数据写入的吞吐量，因为索引和

搜索过程是完全分开并且异步的。Lily HBase Indexer 在 CDH5 中运行必须依赖 HBase、

SolrCloud 和 Zookeeper 服务。

一、实时查询方案

Hbase -----> Key Value Store ---> Solr -------> Web 前端实时查询展示

1.Hbase 提供海量数据存储

2.Solr 提供索引构建与查询

3. Key Value Store 提供自动化索引构建(从 Hbase 到 Solr)

使用流程

前提: CDH5.3.2Solr 集群搭建好,CDH5.3.2 Key-Value Store Indexer 集群搭建好

1. 开启 Hbase 的复制功能

2. Hbase 表需要开启 REPLICATION 复制功能

create 'table',{NAME => 'cf', REPLICATION_SCOPE => 1} #其中 1 表示开启

replication 功能，0 表示不开启，默认为 0

对于已经创建的表可以使用如下命令

disable 'table'

alter 'table',{NAME => 'cf', REPLICATION_SCOPE => 1}

enable 'table'

3. 生成实体配置文件, /opt/hbase-indexer/Test 是自定义路径，可以自己设置

solrctl instancedir --generate /opt/cdhsolr/waslog

4.编辑生成好的 scheme.xml 文件

把 hbase 表中需要索引的列添加到 scheme.xml filed 节点,其中的 name 属性值要与

Morphline.conf 文件中的 outputField 属性值对应

5.创建 collection 实例并配置文件上传到 zookeeper，命令

solrctl instancedir --create waslog /opt/cdhsor/waslog

6.上传到 zookeeper 之后，其他节点就可以从 zookeeper 下载配置文件。接下来创建 collection，

命令：

solrctl collection –create waslog -s 15 –r 2 –m 50

7.创建 Lily HBase Indexer 配置文件

morphline-hbase-mapper.xml

<?xml version="1.0" encoding="UTF-8"?>

<indexer table="waslog"

mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper">

<param name="morphlineFile" value="morphlines.conf"></param>

<param name="morphlineId" value="wasMap"></param>

</indexer>

其中 morphlineId 的 value 是对应 Key-Value Store Indexer 中配置文件 Morphlines.conf 中

morphlines 属性 id 值

8.修改 Morphlines 文件, 具体操作：进入 Key-Value Store Indexer 面板->配置->查看和编辑->

属性-Morphline 文件，

morphlines : [

{

id :waslogMap

importCommands : ["org.kitesdk.**", "com.ngdata.**"]

commands : [

{

extractHBaseCells {

mappings : [

{

inputColumn : "cf:LOGSYFG"

outputField : "LOGSYFG"

type : string

source : value

},

{

inputColumn : "cf:LOGIPAD"

outputField : "LOGIPAD"

type : string

source : value

},

{

inputColumn : "cf:LOGSEQC"

outputField : "LOGSEQC"

type : string

source : value

},

{

inputColumn : "cf:LOGLGDT"

outputField : "LOGLGDT"

type : string

source : value

},

{

inputColumn : "cf:LOGLGTM"

outputField : "LOGLGTM"

type : string

source : value

}

]

}

}

{ logDebug { format : "output record: {}", args : ["@{}"] } }

]

}

]

inputColumn:Hbase 的 CLOUMN

outputField:Solr 的 Schema.XML 配置的 fields

9.注册 Lily HBase Indexer configuration 和 Lily Hbase Indexer Service

hbase-indexer add-indexer \

--name cloudIndexer \

--indexer-conf /opt/cdhsolr/morphline-hbase-mapper.xml

--connection-param solr.zk=cdh1:2181,cdh2:2181,cdh3:2181/solr \

--connection-param solr.collection=waslog \

--zookeeper cdh1:2181,cdh2:2181,cdh3:2181

(

hbase-indexer add-indexer \

--name waslogIndexer \

--indexer-conf $HOME/morphline-hbase-mapper.xml \

--connection-param solr.zk=web04.cloud.9ffox.com,web01.cloud.9ffox.com,web02.cloud.9ffox.com/solr \

--connection-param solr.collection=waslog \

--zookeeper web04.cloud.9ffox.com:2181,web01.cloud.9ffox.com:2181,web02.cloud.9ffox.com:2181

)

验证索引器是否成功创建

hbase-indexer list-indexers

(hbase-indexer list-indexers -zookeeper web04.cloud.9ffox.com:2181,web01.cloud.9ffox.com:2181,web02.cloud.9ffox.com:2181)

10.测试 put 数据查看结果

当写入数据后，稍过几秒我们可以在相对于的 solr 中查询到该插入的数据，表明配置已经成

功。

11.使用 IK 分词器

在/opt/cloudera/parcels/CDH/lib/solr/webapps/solr/WEB-INF 创建 classes 目录

把 IKAnalyzer.cfg.xml 和 stopword.dic 添加到 classes 目录

把 IKAnalyzer2012FF_u1.jar 添加到

/opt/cloudera/parcels/CDH/lib/solr/webapps/solr/WEB-INF/lib 目录

在 Schema.xml 中添加

<fieldType name="text_ik" class="solr.TextField">

<analyzer type="index" isMaxWordLength="false"

class="org.wltea.analyzer.lucene.IKAnalyzer"/>

<analyzer type="query" isMaxWordLength="true"

class="org.wltea.analyzer.lucene.IKAnalyzer"/>

</fieldType>

配置好后更新 ZK 配置文件,重启 solr 服务

12,扩展命令

Scheme.xml 新增索引字段

执行以下命令更新配置

solrctl instancedir --update waslog /opt/cdhsolr /waslog

solrctl collection --reload waslog

查看 collection 命令：solrctl collection –list

Hbase 表数据到 SOLR 集群迁移

在 CDH5.3.2 中 Hbase-indexer 提供了 MapReduce 来批量构建索引的方式

/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/lib/hbase-solr/tools/hbase-indexer-mr-1.5-cd

h5.3.2-job.jar

构建命令

hadoop jar

/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/lib/hbase-solr/tools/hbase-indexer-mr-1.5-cdh5.3.2-job.jar

D 'mapreduce.reduce.shuffle.memory.limit.percent=0.06' --hbase-indexer-file

/opt/cdhsolr/mapping/waslog/morphline-hbase-mapper.xml --zk-host

hadoop03:2181,hadoop04:2181,hadoop05:2181/solr --collection waslog --go-live

注意:在运行命令的目录下必须有 morphlines.conf 文件

hadoop --config /etc/hadoop/conf \

jar /app/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-1.5-cdh6.0.0-job.jar \

--conf /etc/hbase/conf/hbase-site.xml -D 'mapred.child.java.opts=-Xmx1024m' \

--hbase-indexer-file /home/appuser/morphline-hbase-mapper.xml \

--zk-host web04.cloud.9ffox.com:2181,web01.cloud.9ffox.com:2181,web02.cloud.9ffox.com:2181/solr --collection waslog \

--go-live

注：历史数据同步错误原因分析

登录Cloudera Manager

点击群集-->YARN-->Web UI-->ResourceManager WebUI(dbp01)

根据id查看job信息failed的详细error信息

hbase-indexer add-indexer \

--name tob_fire_scoreIndexer \

--indexer-conf /home/appuser/tob_fire_score/morphline-hbase-mapper.xml \

--connection-param solr.zk=web04.cloud.9ffox.com,web01.cloud.9ffox.com,web02.cloud.9ffox.com/solr \

--connection-param solr.collection=tob_fire_score \

--zookeeper web04.cloud.9ffox.com:2181,web01.cloud.9ffox.com:2181,web02.cloud.9ffox.com:2181

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Cloudera Search之三 Hbase二级索引方案

Hbase 二级索引方案概述在 Hbase 中,表的 RowKey 按照字典排序, Region 按照 RowKey 设置 split point 进行 shard，通过这种方式实现的全局、分布式索引. 成为了其成功的最大的砝码。然而单一的通过 RowKey 检索数据的方式,不再满足更多的需求，查询成为 Hbase 的瓶颈，人们更加希望像 Sql 一样快速检索数据，可是，Hba...
复制链接

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。