1、需求分析
HBase的查询实现只提供两种方式:
1、按指定RowKey获取唯一一条记录,get方法(org.apache.hadoop.hbase.client.Get)
2、按指定的条件获取一批记录,scan方法(org.apache.hadoop.hbase.client.Scan)
用好HBase的第一步是要将rowkey设计好。大数据量查询最好从rowkey入手,ColumnValueFilter的速度是很慢的,HBase查询速度还是要依靠rowkey,所以根据业务逻辑把rowkey设计好,之后所有的查询都通过rowkey,是会非常快。 批量查询最好是用 scan的startkey endkey来做查询条件。
方案:
1、HBase在0.92之后引入了coprocessors,提供了一系列的钩子,让我们能够轻易实现访问控制和二级索引的特性。
2、Apache Phoenix: 功能围绕着SQL on hbase,支持和兼容多个hbase版本, 二级索引只是其中一块功能。 二级索引的创建和管理直接有SQL语法支持,使用起来很简便, 该项目目前社区活跃度和版本更新迭代情况都比较好。
3、常见的是采用底层基于Apache Lucene的Elasticsearch(下面简称ES)或Apache Solr ,来构建强大的索引能力、搜索能力, 例如支持模糊查询、全文检索、组合查询、排序等。
其实对于在外部自定义构建二级索引的方式, 有自己的大数据团队的公司一般都会针对自己的业务场景进行优化,自行构建ES/Solr的搜索集群。 例如数说故事企业内部的百亿级数据全量库,就是基于ES构建海量索引和检索能力的案例。
HBase + ES的解决方案
下面显示了数说基于ES做二级索引的两种构建流程,包含:
增量索引: 日常持续接入的数据源,进行增量的索引更新
全量索引: 配套基于Spark/MR的批量索引创建/更新程序, 用于初次或重建已有HBase库表的索引
- 批量索引:HBase上已有大量数据,需要在ES上建立索引;
- 增量索引:HBase上已有大量数据,提供HBase的rowkey,实现对ES的增量索引;
- 实时索引:HBase表持续进入数据,该表的数据要与ES的索引实时更新;
由于 不清楚添加协处理器具体思路是把RowKey作为索引文档的ID,并把要进行查询的Column索引到ES。
用户输入Column值作为搜索条件,通过ES查询到该Column对应的RowKey值,再根据RowKey到HBase中查询完整的数据。
数据查询流程:
Datastory在做全量库的过程中,还是有更多遇到的问题要解决,诸如数据一致性、大量小索引、多版本ES集群共存等, 会在后续进行更细致的介绍和分享。
2、解决方案
方案1:
如果是对写入数据性能要求高的业务场景,那么一份数据先写到HBase,然后再将索引字段写到ES中,两个写入流程独立,这样可以达到性能最大,目前某公安厅使用该方案,每天需要写入数据200亿,6T数据,每个记录建20左右的索引。
缺点:可能存在数据的不一致性。
· 当一个新的Put操作产生时,将Put数据转化为json,索引到ElasticSearch,并把RowKey作为新文档的ID
· 当一个新的Delete操作产生时,获取Delete数据的RowKey,删除ElasticSearch中对应的ID
方案2:
这也是目前网上比较流行的方案,使用HBase的协处理监听数据在HBase中的变动,实时的更新ES中的索引,
缺点:协处理器会影响HBase的性能
方案1代码:
articles.json
{
"settings":{
"number_of_shards":3,
"number_of_replicas":1
},
"mappings":{
"article":{
"dynamic":"strict",
"properties":{
"id":{"type":"integer","store":"yes"},
"title":{"type":"string","store":"yes","index":"analyzed","analyzer":" ik"},
"describe":{"type":"string","store":"yes","index":"analyzed","analyzer":" ik"},
"author":{"type":"string","store":"yes","index":"no"}
}
}
}
}
public class DataImportHBaseAndIndex {
public static final String FILE_PATH= "D:/bigdata/es_hbase/datasrc/article.txt" ;
public static void main(String[] args) throws java.lang.Exception {
// 读取数据源
InputStream in = new FileInputStream(new File(FILE_PATH ));
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(in ,"UTF-8" ));
String line = null;
List<Article> articleList = new ArrayList<Article>();
Article article = null;
while ((line = bufferedReader .readLine()) != null) {
String[] split = StringUtils.split(line, "\t");
article = new Article();
article.setId( new Integer(split [0]));
article.setTitle( split[1]);
article.setAuthor( split[2]);
article.setDescribe( split[3]);
article.setContent( split[3]);
articleList.add(article );
}
for (Article a : articleList ) {
// HBase插入数据
HBaseUtils hBaseUtils = new HBaseUtils();
hBaseUtils.put(hBaseUtils .TABLE_NAME , String.valueOf(a.getId()), hBaseUtils.COLUMNFAMILY_1 ,
hBaseUtils.COLUMNFAMILY_1_TITLE , a .getTitle());
hBaseUtils.put(hBaseUtils .TABLE_NAME , String.valueOf(a.getId()), hBaseUtils.COLUMNFAMILY_1 ,
hBaseUtils.COLUMNFAMILY_1_AUTHOR , a.getAuthor());
hBaseUtils.put(hBaseUtils .TABLE_NAME , String.valueOf(a.getId()), hBaseUtils.COLUMNFAMILY_1 ,
hBaseUtils.COLUMNFAMILY_1_DESCRIBE , a.getDescribe());
hBaseUtils.put(hBaseUtils .TABLE_NAME , String.valueOf(a.getId()), hBaseUtils.COLUMNFAMILY_1 ,
hBaseUtils.COLUMNFAMILY_1_CONTENT , a.getContent());
// ElasticSearch 插入数据
EsUtil. addIndex(EsUtil.DEFAULT_INDEX, EsUtil.DEFAULT_TYPE , a );
}
}
}
/**
* 向ElasticSearch添加索引
* @param index 索引库名称
* @param type 搜索类型
* @param article 数据
* @return 当前doc的id
*/
public static String addIndex(String index , String type, Article article ) {
HashMap<String, Object> hashMap = new HashMap<String, Object>();
hashMap.put( "id", article .getId());
hashMap.put( "title", article .getTitle());
hashMap.put( "describe", article.getDescribe());
hashMap.put( "author", article .getAuthor());
IndexResponse response = getInstance().prepareIndex(index, type, String.valueOf(article.getId())).setSource( hashMap).get();
return response .getId();
}
/**
* ElasticSearch查询
* @param skey 搜索关键字
* @param index 索引库
* @param type 类型
* @param start 开始下标
* @param row 每页显示最大记录书
* @return 数据记录
*/
public static Map<String, Object> search(String skey, String index , String type, Integer start, Integer row) {
HashMap<String, Object> dataMap = new HashMap<String, Object>();
ArrayList<Map<String, Object>> dataList = new ArrayList<Map<String, Object>>();
SearchRequestBuilder builder = EsUtil.getInstance().prepareSearch(index);
builder.setTypes( type);
builder.setSearchType(SearchType. DFS_QUERY_THEN_FETCH);
if (StringUtils.isNotBlank( skey)) {
builder.setQuery(QueryBuilders. multiMatchQuery(skey, "title", "describe" ));// 匹配title和describe两个字段
}
builder.setFrom( start);
builder.setSize( row);
// 设置查询高亮显示
builder.addHighlightedField( "title");
builder.addHighlightedField( "describe");
builder.setHighlighterPreTags( "<font color='red'>");
builder.setHighlighterPostTags( "</font>");
SearchResponse response = builder.get(); // 执行查询
SearchHits hits = response.getHits();
long totalCount = hits .getTotalHits();// 总数
SearchHit[] hits2 = hits.getHits();
Map<String, Object> source = null;
for (SearchHit searchHit : hits2 ) {
source = searchHit.getSource();// 获取查询数据
// 查询字段title和 describle高亮控制
Map<String, HighlightField> highlightFields = searchHit .getHighlightFields();
HighlightField highlightFieldTile = highlightFields.get("title" );
if (highlightFieldTile != null) {
Text[] fragments = highlightFieldTile.getFragments();
String name = "";
for (Text text : fragments ) {
name += text;
}
source.put( "title", name );// 设置title高亮字段
}
HighlightField highlightFieldDescribe = highlightFields.get("describe" );
if (highlightFieldDescribe != null) {
Text[] fragments = highlightFieldDescribe.getFragments();
String name = "";
for (Text text : fragments ) {
name += text;
}
source.put( "describe", name ); // 设置describe高亮字段
}
dataList.add( source);
}
dataMap.put( "count", totalCount );
dataMap.put( "dataList", dataList );
return dataMap ;
}
方案2代码:https://segmentfault.com/a/1190000018071516
HbaseDataSyncEsObserver.java
public class HbaseDataSyncEsObserver implements RegionObserver, RegionCoprocessor {
private static final Logger LOG = Logger.getLogger(HbaseDataSyncEsObserver.class);
public Optional<RegionObserver> getRegionObserver() {
return Optional.of(this);
}
public void start(CoprocessorEnvironment env) throws IOException {
LOG.info("====Test Start====");
}
public void stop(CoprocessorEnvironment env) throws IOException {
LOG.info("====Test End====");
}
public void postPut(ObserverContext<RegionCoprocessorEnvironment> e, Put put, WALEdit edit, Durability durability) throws IOException {
LOG.info("====Test postPut====");
}
public void postDelete(ObserverContext<RegionCoprocessorEnvironment> e, Delete delete, WALEdit edit, Durability durability) throws IOException {
LOG.info("====Test postDelete====");
}
}
ElasticSearchBulkOperator.java
public class ElasticSearchBulkOperator {
private static final Log LOG = LogFactory.getLog(ElasticSearchBulkOperator.class);
private static final int MAX_BULK_COUNT = 10000;
private static BulkRequestBuilder bulkRequestBuilder = null;
private static final Lock commitLock = new ReentrantLock();
private static ScheduledExecutorService scheduledExecutorService = null;
static {
// init es bulkRequestBuilder
bulkRequestBuilder = ESClient.client.prepareBulk();
bulkRequestBuilder.setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE);
// init thread pool and set size 1
scheduledExecutorService = Executors.newScheduledThreadPool(1);
// create beeper thread( it will be sync data to ES cluster)
// use a commitLock to protected bulk es as thread-save
final Runnable beeper = () -> {
commitLock.lock();
try {
bulkRequest(0);
} catch (Exception ex) {
System.out.println(ex.getMessage());
} finally {
commitLock.unlock();
}
};
// set time bulk task
// set beeper thread(10 second to delay first execution , 30 second period between successive executions)
scheduledExecutorService.scheduleAtFixedRate(beeper, 10, 30, TimeUnit.SECONDS);
}
public static void shutdownScheduEx() {
if (null != scheduledExecutorService && !scheduledExecutorService.isShutdown()) {
scheduledExecutorService.shutdown();
}
}
private static void bulkRequest(int threshold) {
if (bulkRequestBuilder.numberOfActions() > threshold) {
BulkResponse bulkItemResponse = bulkRequestBuilder.execute().actionGet();
if (!bulkItemResponse.hasFailures()) {
bulkRequestBuilder = ESClient.client.prepareBulk();
}
}
}
/**
* add update builder to bulk
* use commitLock to protected bulk as thread-save
* @param builder
*/
public static void addUpdateBuilderToBulk(UpdateRequestBuilder builder) {
commitLock.lock();
try {
bulkRequestBuilder.add(builder);
bulkRequest(MAX_BULK_COUNT);
} catch (Exception ex) {
LOG.error(" update Bulk " + "gejx_test" + " index error : " + ex.getMessage());
} finally {
commitLock.unlock();
}
}
/**
* add delete builder to bulk
* use commitLock to protected bulk as thread-save
*
* @param builder
*/
public static void addDeleteBuilderToBulk(DeleteRequestBuilder builder) {
commitLock.lock();
try {
bulkRequestBuilder.add(builder);
bulkRequest(MAX_BULK_COUNT);
} catch (Exception ex) {
LOG.error(" delete Bulk " + "gejx_test" + " index error : " + ex.getMessage());
} finally {
commitLock.unlock();
}
}
}
ESClient.java
public class ESClient {
public static Client client;
/**
* init ES client
*/
public static void initEsClient() throws UnknownHostException {
System.setProperty("es.set.netty.runtime.available.processors", "false");
Settings esSettings = Settings.builder().put("cluster.name", "elasticsearch").build();//设置ES实例的名称
client = new PreBuiltTransportClient(esSettings).addTransportAddress(new TransportAddress(InetAddress.getByName("localhost"), 9300));
}
/**
* Close ES client
*/
public static void closeEsClient() {
client.close();
}
}
HbaseDataSyncEsObserver.java
public class HbaseDataSyncEsObserver implements RegionObserver , RegionCoprocessor {
private static final Logger LOG = Logger.getLogger(HbaseDataSyncEsObserver.class);
public Optional<RegionObserver> getRegionObserver() {
return Optional.of(this);
}
@Override
public void start(CoprocessorEnvironment env) throws IOException {
// init ES client
ESClient.initEsClient();
LOG.info("****init start*****");
}
@Override
public void stop(CoprocessorEnvironment env) throws IOException {
ESClient.closeEsClient();
// shutdown time task
ElasticSearchBulkOperator.shutdownScheduEx();
LOG.info("****end*****");
}
@Override
public void postPut(ObserverContext<RegionCoprocessorEnvironment> e, Put put, WALEdit edit, Durability durability) throws IOException {
String indexId = new String(put.getRow());
try {
NavigableMap<byte[], List<Cell>> familyMap = put.getFamilyCellMap();
Map<String, Object> infoJson = new HashMap<>();
Map<String, Object> json = new HashMap<>();
for (Map.Entry<byte[], List<Cell>> entry : familyMap.entrySet()) {
for (Cell cell : entry.getValue()) {
String key = Bytes.toString(CellUtil.cloneQualifier(cell));
String value = Bytes.toString(CellUtil.cloneValue(cell));
json.put(key, value);
}
}
// set hbase family to es
infoJson.put("info", json);
LOG.info(json.toString());
ElasticSearchBulkOperator.addUpdateBuilderToBulk(ESClient.client.prepareUpdate("gejx_test","dmp_ods", indexId).setDocAsUpsert(true).setDoc(json));
LOG.info("**** postPut success*****");
} catch (Exception ex) {
LOG.error("observer put a doc, index [ " + "gejx_test" + " ]" + "indexId [" + indexId + "] error : " + ex.getMessage());
}
}
@Override
public void postDelete(ObserverContext<RegionCoprocessorEnvironment> e, Delete delete, WALEdit edit, Durability durability) throws IOException {
String indexId = new String(delete.getRow());
try {
ElasticSearchBulkOperator.addDeleteBuilderToBulk(ESClient.client.prepareDelete("gejx_test", "dmp_ods", indexId));
LOG.info("**** postDelete success*****");
} catch (Exception ex) {
LOG.error(ex);
LOG.error("observer delete a doc, index [ " + "gejx_test" + " ]" + "indexId [" + indexId + "] error : " + ex.getMessage());
}
}
}
上传包的时候,需要上传到 HDFS 下,同时,要给 hbase 用户授予权限,因而,我在测试的过程中,将其上传到 /apps/hbase 下(HDP 环境)。
# 创建测试表
create 'gejx_test','cf'
# 停用测试表
disable 'gejx_test'
# 表与协处理器建立关系
alter 'gejx_test' , METHOD =>'table_att','coprocessor'=>'hdfs://dev-dmp2.fengdai.org:8020/apps/hbase/hbase-observer-simple-example.jar|com.tairanchina.csp.dmp.examples.HbaseDataSyncEsObserver|1073741823'
# 启用表
enable 'gejx_test'
# 查看表信息
desc 'gejx_test'
3、查询速度测试
轨迹回放主要从整体统计查询和实时响应两方面性能进行体现,图8测试了在PB级数量基数水平上,结果集数量不同时的数据查询效率。
当小结果集进行查询时,不添加索引时响应速度在10 s以上,而进行索引时,数据查询速度在1 s以内,速度提升了20倍左右。大结果集(万级)进行查询时,速度提升了9~10倍,实时查询效率及速度大幅度提升。在实际应用中,20 min的轨迹回放约为600条数据,能够实现5 s之内查询,因为数据响应可分段,所以如果以5 min为时间段进行4次查询,能够达到页面较为流畅的效果。