ElasticSearch+hbase

最新推荐文章于 2023-06-16 16:13:53 发布

四月天03

最新推荐文章于 2023-06-16 16:13:53 发布

阅读量798

点赞数

分类专栏： Hbase Elasticsearch

本文链接：https://blog.csdn.net/qq_22473611/article/details/88822203

版权

Hbase 同时被 2 个专栏收录

15 篇文章 10 订阅

订阅专栏

Elasticsearch

6 篇文章 1 订阅

订阅专栏

1、需求分析

HBase的查询实现只提供两种方式：
1、按指定RowKey获取唯一一条记录，get方法（org.apache.hadoop.hbase.client.Get）
2、按指定的条件获取一批记录，scan方法（org.apache.hadoop.hbase.client.Scan）
用好HBase的第一步是要将rowkey设计好。大数据量查询最好从rowkey入手，ColumnValueFilter的速度是很慢的，HBase查询速度还是要依靠rowkey，所以根据业务逻辑把rowkey设计好，之后所有的查询都通过rowkey，是会非常快。批量查询最好是用 scan的startkey endkey来做查询条件。

方案：

1、HBase在0.92之后引入了coprocessors，提供了一系列的钩子，让我们能够轻易实现访问控制和二级索引的特性。

2、Apache Phoenix：功能围绕着SQL on hbase，支持和兼容多个hbase版本，二级索引只是其中一块功能。二级索引的创建和管理直接有SQL语法支持，使用起来很简便，该项目目前社区活跃度和版本更新迭代情况都比较好。

3、常见的是采用底层基于Apache Lucene的Elasticsearch(下面简称ES)或Apache Solr ，来构建强大的索引能力、搜索能力，例如支持模糊查询、全文检索、组合查询、排序等。

其实对于在外部自定义构建二级索引的方式，有自己的大数据团队的公司一般都会针对自己的业务场景进行优化，自行构建ES/Solr的搜索集群。例如数说故事企业内部的百亿级数据全量库，就是基于ES构建海量索引和检索能力的案例。

HBase + ES的解决方案

下面显示了数说基于ES做二级索引的两种构建流程，包含：

增量索引： 日常持续接入的数据源，进行增量的索引更新

全量索引： 配套基于Spark/MR的批量索引创建/更新程序，用于初次或重建已有HBase库表的索引

批量索引：HBase上已有大量数据，需要在ES上建立索引；
增量索引：HBase上已有大量数据，提供HBase的rowkey，实现对ES的增量索引；
实时索引：HBase表持续进入数据，该表的数据要与ES的索引实时更新；

由于不清楚添加协处理器具体思路是把RowKey作为索引文档的ID，并把要进行查询的Column索引到ES。
用户输入Column值作为搜索条件，通过ES查询到该Column对应的RowKey值，再根据RowKey到HBase中查询完整的数据。

数据查询流程：

Datastory在做全量库的过程中，还是有更多遇到的问题要解决，诸如数据一致性、大量小索引、多版本ES集群共存等，会在后续进行更细致的介绍和分享。

2、解决方案

方案1：

如果是对写入数据性能要求高的业务场景，那么一份数据先写到HBase,然后再将索引字段写到ES中，两个写入流程独立，这样可以达到性能最大，目前某公安厅使用该方案，每天需要写入数据200亿，6T数据，每个记录建20左右的索引。
缺点：可能存在数据的不一致性。

· 当一个新的Put操作产生时，将Put数据转化为json，索引到ElasticSearch，并把RowKey作为新文档的ID

· 当一个新的Delete操作产生时，获取Delete数据的RowKey，删除ElasticSearch中对应的ID

方案2：

这也是目前网上比较流行的方案，使用HBase的协处理监听数据在HBase中的变动，实时的更新ES中的索引，
缺点：协处理器会影响HBase的性能

方案1代码：

articles.json

{
    "settings":{
         "number_of_shards":3,
         "number_of_replicas":1
    },
    "mappings":{
         "article":{
             "dynamic":"strict",
             "properties":{
                 "id":{"type":"integer","store":"yes"},
                 "title":{"type":"string","store":"yes","index":"analyzed","analyzer":" ik"},
                 "describe":{"type":"string","store":"yes","index":"analyzed","analyzer":" ik"},
                 "author":{"type":"string","store":"yes","index":"no"}
             }
         }
    }
}

public class DataImportHBaseAndIndex {
 
    public static final String FILE_PATH= "D:/bigdata/es_hbase/datasrc/article.txt" ;
    public static void main(String[] args) throws java.lang.Exception {
          // 读取数据源
         InputStream in = new FileInputStream(new File(FILE_PATH ));
         BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(in ,"UTF-8" ));
         String line = null;
         List<Article> articleList = new ArrayList<Article>();
         Article article = null;
          while ((line = bufferedReader .readLine()) != null) {
             String[] split = StringUtils.split(line, "\t");
              article = new Article();
              article.setId( new Integer(split [0]));
              article.setTitle( split[1]);
              article.setAuthor( split[2]);
              article.setDescribe( split[3]);
              article.setContent( split[3]);
              articleList.add(article );
         }
 
          for (Article a : articleList ) {
              // HBase插入数据
             HBaseUtils hBaseUtils = new HBaseUtils();
              hBaseUtils.put(hBaseUtils .TABLE_NAME , String.valueOf(a.getId()), hBaseUtils.COLUMNFAMILY_1 ,
                      hBaseUtils.COLUMNFAMILY_1_TITLE , a .getTitle());
              hBaseUtils.put(hBaseUtils .TABLE_NAME , String.valueOf(a.getId()), hBaseUtils.COLUMNFAMILY_1 ,
                      hBaseUtils.COLUMNFAMILY_1_AUTHOR , a.getAuthor());
              hBaseUtils.put(hBaseUtils .TABLE_NAME , String.valueOf(a.getId()), hBaseUtils.COLUMNFAMILY_1 ,
                      hBaseUtils.COLUMNFAMILY_1_DESCRIBE , a.getDescribe());
              hBaseUtils.put(hBaseUtils .TABLE_NAME , String.valueOf(a.getId()), hBaseUtils.COLUMNFAMILY_1 ,
                      hBaseUtils.COLUMNFAMILY_1_CONTENT , a.getContent());
 
              // ElasticSearch 插入数据
             EsUtil. addIndex(EsUtil.DEFAULT_INDEX, EsUtil.DEFAULT_TYPE , a );
         }
    }
}

/**
     * 向ElasticSearch添加索引 
     * @param index 索引库名称
     * @param type  搜索类型
     * @param article  数据
     * @return 当前doc的id
     */
    public static String addIndex(String index , String type, Article article ) {
 
         HashMap<String, Object> hashMap = new HashMap<String,  Object>();
          hashMap.put( "id", article .getId());
          hashMap.put( "title", article .getTitle());
          hashMap.put( "describe", article.getDescribe());
          hashMap.put( "author", article .getAuthor());
         IndexResponse response = getInstance().prepareIndex(index, type, String.valueOf(article.getId())).setSource( hashMap).get();
          return response .getId();
    }

/**
     * ElasticSearch查询 
     * @param skey  搜索关键字
     * @param index  索引库
     * @param type 类型
     * @param start 开始下标
     * @param row 每页显示最大记录书
     * @return 数据记录
     */
    public static Map<String, Object> search(String skey, String index , String type, Integer start, Integer row) {
 
         HashMap<String, Object> dataMap = new HashMap<String, Object>();
         ArrayList<Map<String, Object>> dataList = new ArrayList<Map<String, Object>>();
 
         SearchRequestBuilder builder = EsUtil.getInstance().prepareSearch(index);
          builder.setTypes( type);
         builder.setSearchType(SearchType. DFS_QUERY_THEN_FETCH);
          if (StringUtils.isNotBlank( skey)) {
              builder.setQuery(QueryBuilders. multiMatchQuery(skey, "title", "describe" ));// 匹配title和describe两个字段
         }
          builder.setFrom( start);
          builder.setSize( row);
 
          // 设置查询高亮显示
          builder.addHighlightedField( "title");
          builder.addHighlightedField( "describe");
          builder.setHighlighterPreTags( "<font color='red'>");
          builder.setHighlighterPostTags( "</font>");
 
         SearchResponse response = builder.get(); // 执行查询
         SearchHits hits = response.getHits();
          long totalCount = hits .getTotalHits();// 总数
         SearchHit[] hits2 = hits.getHits();
         Map<String, Object> source = null;
          for (SearchHit searchHit : hits2 ) {
              source = searchHit.getSource();// 获取查询数据
              // 查询字段title和 describle高亮控制
             Map<String, HighlightField> highlightFields = searchHit .getHighlightFields();
 
             HighlightField highlightFieldTile = highlightFields.get("title" );
              if (highlightFieldTile != null) {
                 Text[] fragments = highlightFieldTile.getFragments();
                 String name = "";
                  for (Text text : fragments ) {
                       name += text;
                 }
                  source.put( "title", name );// 设置title高亮字段
             }
             HighlightField highlightFieldDescribe = highlightFields.get("describe" );
              if (highlightFieldDescribe != null) {
                 Text[] fragments = highlightFieldDescribe.getFragments();
                 String name = "";
                  for (Text text : fragments ) {
                       name += text;
                 }
                  source.put( "describe", name ); // 设置describe高亮字段
             }
 
              dataList.add( source);
         }
          dataMap.put( "count", totalCount );
          dataMap.put( "dataList", dataList );
          return dataMap ;
    }

方案2代码：https://segmentfault.com/a/1190000018071516

HbaseDataSyncEsObserver.java

public class HbaseDataSyncEsObserver implements RegionObserver, RegionCoprocessor {

    private static final Logger LOG = Logger.getLogger(HbaseDataSyncEsObserver.class);

    public Optional<RegionObserver> getRegionObserver() {
        return Optional.of(this);
    }

    public void start(CoprocessorEnvironment env) throws IOException {
        LOG.info("====Test Start====");
    }

    public void stop(CoprocessorEnvironment env) throws IOException {
        LOG.info("====Test End====");
    }

    public void postPut(ObserverContext<RegionCoprocessorEnvironment> e, Put put, WALEdit edit, Durability durability) throws IOException {
        LOG.info("====Test postPut====");
    }
    public void postDelete(ObserverContext<RegionCoprocessorEnvironment> e, Delete delete, WALEdit edit, Durability durability) throws IOException {
        LOG.info("====Test postDelete====");
    }
}

ElasticSearchBulkOperator.java

public class ElasticSearchBulkOperator {

    private static final Log LOG = LogFactory.getLog(ElasticSearchBulkOperator.class);

    private static final int MAX_BULK_COUNT = 10000;

    private static BulkRequestBuilder bulkRequestBuilder = null;

    private static final Lock commitLock = new ReentrantLock();

    private static ScheduledExecutorService scheduledExecutorService = null;
    static {
        // init es bulkRequestBuilder
        bulkRequestBuilder = ESClient.client.prepareBulk();
        bulkRequestBuilder.setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE);

        // init thread pool and set size 1
        scheduledExecutorService = Executors.newScheduledThreadPool(1);

        // create beeper thread( it will be sync data to ES cluster)
        // use a commitLock to protected bulk es as thread-save
        final Runnable beeper = () -> {
            commitLock.lock();
            try {
                bulkRequest(0);
            } catch (Exception ex) {
                System.out.println(ex.getMessage());
            } finally {
                commitLock.unlock();
            }
        };

        // set time bulk task
        // set beeper thread(10 second to delay first execution , 30 second period between successive executions)
        scheduledExecutorService.scheduleAtFixedRate(beeper, 10, 30, TimeUnit.SECONDS);

    }
    public static void shutdownScheduEx() {
        if (null != scheduledExecutorService && !scheduledExecutorService.isShutdown()) {
            scheduledExecutorService.shutdown();
        }
    }
    private static void bulkRequest(int threshold) {
        if (bulkRequestBuilder.numberOfActions() > threshold) {
            BulkResponse bulkItemResponse = bulkRequestBuilder.execute().actionGet();
            if (!bulkItemResponse.hasFailures()) {
                bulkRequestBuilder = ESClient.client.prepareBulk();
            }
        }
    }

    /**
     * add update builder to bulk
     * use commitLock to protected bulk as thread-save
     * @param builder
     */
    public static void addUpdateBuilderToBulk(UpdateRequestBuilder builder) {
        commitLock.lock();
        try {
            bulkRequestBuilder.add(builder);
            bulkRequest(MAX_BULK_COUNT);
        } catch (Exception ex) {
            LOG.error(" update Bulk " + "gejx_test" + " index error : " + ex.getMessage());
        } finally {
            commitLock.unlock();
        }
    }

    /**
     * add delete builder to bulk
     * use commitLock to protected bulk as thread-save
     *
     * @param builder
     */
    public static void addDeleteBuilderToBulk(DeleteRequestBuilder builder) {
        commitLock.lock();
        try {
            bulkRequestBuilder.add(builder);
            bulkRequest(MAX_BULK_COUNT);
        } catch (Exception ex) {
            LOG.error(" delete Bulk " + "gejx_test" + " index error : " + ex.getMessage());
        } finally {
            commitLock.unlock();
        }
    }
}

ESClient.java

public class ESClient {
    
    public static Client client;
    
    /**
     * init ES client
     */
    public static void initEsClient() throws UnknownHostException {
        System.setProperty("es.set.netty.runtime.available.processors", "false");
        Settings esSettings = Settings.builder().put("cluster.name", "elasticsearch").build();//设置ES实例的名称
        client = new PreBuiltTransportClient(esSettings).addTransportAddress(new TransportAddress(InetAddress.getByName("localhost"), 9300));

    }

    /**
     * Close ES client
     */
    public static void closeEsClient() {
        client.close();
    }
}

HbaseDataSyncEsObserver.java

public class HbaseDataSyncEsObserver implements RegionObserver , RegionCoprocessor {

    private static final Logger LOG = Logger.getLogger(HbaseDataSyncEsObserver.class);

    public Optional<RegionObserver> getRegionObserver() {
        return Optional.of(this);
    }

    @Override
    public void start(CoprocessorEnvironment env) throws IOException {
        // init ES client
        ESClient.initEsClient();
        LOG.info("****init start*****");
    }

    @Override
    public void stop(CoprocessorEnvironment env) throws IOException {
        ESClient.closeEsClient();
        // shutdown time task
        ElasticSearchBulkOperator.shutdownScheduEx();
        LOG.info("****end*****");
    }

    @Override
    public void postPut(ObserverContext<RegionCoprocessorEnvironment> e, Put put, WALEdit edit, Durability durability) throws IOException {
        String indexId = new String(put.getRow());
        try {
            NavigableMap<byte[], List<Cell>> familyMap = put.getFamilyCellMap();
            Map<String, Object> infoJson = new HashMap<>();
            Map<String, Object> json = new HashMap<>();
            for (Map.Entry<byte[], List<Cell>> entry : familyMap.entrySet()) {
                for (Cell cell : entry.getValue()) {
                    String key = Bytes.toString(CellUtil.cloneQualifier(cell));
                    String value = Bytes.toString(CellUtil.cloneValue(cell));
                    json.put(key, value);
                }
            }
            // set hbase family to es
            infoJson.put("info", json);
            LOG.info(json.toString());
            ElasticSearchBulkOperator.addUpdateBuilderToBulk(ESClient.client.prepareUpdate("gejx_test","dmp_ods", indexId).setDocAsUpsert(true).setDoc(json));
            LOG.info("**** postPut success*****");
        } catch (Exception ex) {
            LOG.error("observer put  a doc, index [ " + "gejx_test" + " ]" + "indexId [" + indexId + "] error : " + ex.getMessage());
        }
    }
    @Override
    public void postDelete(ObserverContext<RegionCoprocessorEnvironment> e, Delete delete, WALEdit edit, Durability durability) throws IOException {
        String indexId = new String(delete.getRow());
        try {
            ElasticSearchBulkOperator.addDeleteBuilderToBulk(ESClient.client.prepareDelete("gejx_test", "dmp_ods", indexId));
            LOG.info("**** postDelete success*****");
        } catch (Exception ex) {
            LOG.error(ex);
            LOG.error("observer delete  a doc, index [ " + "gejx_test" + " ]" + "indexId [" + indexId + "] error : " + ex.getMessage());

        }
    }
}

上传包的时候，需要上传到 HDFS 下，同时，要给 hbase 用户授予权限，因而，我在测试的过程中，将其上传到 /apps/hbase 下（HDP 环境）。

# 创建测试表
create 'gejx_test','cf'
# 停用测试表
disable 'gejx_test'
# 表与协处理器建立关系
alter 'gejx_test' , METHOD =>'table_att','coprocessor'=>'hdfs://dev-dmp2.fengdai.org:8020/apps/hbase/hbase-observer-simple-example.jar|com.tairanchina.csp.dmp.examples.HbaseDataSyncEsObserver|1073741823'
# 启用表
enable 'gejx_test'
# 查看表信息
desc 'gejx_test'

3、查询速度测试

轨迹回放主要从整体统计查询和实时响应两方面性能进行体现，图8测试了在PB级数量基数水平上，结果集数量不同时的数据查询效率。

当小结果集进行查询时，不添加索引时响应速度在10 s以上，而进行索引时，数据查询速度在1 s以内，速度提升了20倍左右。大结果集（万级）进行查询时，速度提升了9～10倍，实时查询效率及速度大幅度提升。在实际应用中，20 min的轨迹回放约为600条数据，能够实现5 s之内查询，因为数据响应可分段，所以如果以5 min为时间段进行4次查询，能够达到页面较为流畅的效果。

四月天03

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
0
评论
ElasticSearch+hbase

HBase在滴滴主要存放了以下四种数据类型：统计结果、报表类数据：主要是运营、运力情况、收入等结果，通常需要配合Phoenix进行SQL查询。数据量较小，对查询的灵活性要求高，延迟要求一般。原始事实类数据：如订单、司机乘客的GPS轨迹、日志等，主要用作在线和离线的数据供给。数据量大，对一致性和可用性要求高，延迟敏感，实时写入，单点或批量查询。中间结果数据：指模型训练...
复制链接

扫一扫