ES--二级索引

1、应用场景

ES优缺点

  • 优点:可以构建全文索引,根据需求可以将任意的数据构建索引来查询
  • 缺点:数据量大,性能不能满足高实时要求,本身数据安全的隐患相对较高

Hbase优缺点

  • 优点:实现大量数据集高性能的实时读写,数据相对安全
  • 缺点:rowkey作为唯一索引,复杂业务中,查询条件肯定是变化多样的
    • 如果查询条件不是rowkey的前缀
    • 无法走索引,只能构建二级索引
  • 为什么不用Hbase中构建二级索引表?
    • 只要有一个需求,就需要构建一个Hbase二级索引表
      • 原表:rowkey:id
      • 索引表:rowkey:name
        • id
      • 索引表:rowkey:age
        • id

ES构建索引表

  • 将所有的条件,在ES中存储构建索引,指向Hbase中rowkey
  • ES中数据
documentIdagenamerowkey:id
018zhangsanmale001
  • Hbase中数据
rowkey:idagenamesex
00118zhangsanmale
00219lisifemale
  • 避免了,在Hbase中需要构建多张二级索引表

2、需求分析

  • 需求:通过检索标题中的关键字、来源、时间、阅读次数等条件查询,获取文章正文内容
  • 数据
ID
标题
来源
时间
阅读次数
正文
  • 能不能存储在ES中?
    • 可以的
    • 有问题存在
      • 性能或者 安全性,没有必要将所有数据存储在ES中
      • 如果存储了大量数据,都构建索引,性能会差

实现

  • 将数据完整的存放在Hbase中
    • id作为Rowkey
    • 标题、来源、时间、阅读次数、正文
  • 将需要用到的查询条件存储在ES中,将所有条件构建索引
    • ID
    • 标题
    • 来源
    • 时间
    • 阅读次数
  • 用户查询
    • step1:在ES中根据查询条件,查询符合条件的数据的ID
  • step2:再通过ID到Hbase中获取这篇文章的正文

流程

  • step1:读取Excel文件,解析每条数据,封装到javaBean中
    每一条数据就是一个JavaBean对象
    将所有的JavaBean放在一个集合中
  • step2:将JavaBean数据写入Hbase和ES
    ES:id、标题、来源、时间、阅读次数
    Hbase:所有的字段都存储在Hbase
    rowkey:id
  • step3:基于标题来查询相关数据的正文内容
    • 先查询ES,获取这个标题相关的ID
    • 根据ID到Hbase中进行查询,返回正文

3、代码实现

  • ES中创建对应的索引库
PUT /articles
{  
    "settings":{  
         "number_of_shards":3,  
         "number_of_replicas":1,
         "analysis" : {
            "analyzer" : {
                "ik" : {
                    "tokenizer" : "ik_max_word"
                }
            }
        }
    }, 
    "mappings":{  
         "article":{  
             "dynamic":"strict",
             "_source": {
               "includes": [
                  "id","title","from","readCount","time"
                ],
               "excludes": [
                  "content"
               ]
             },
             "properties":{  
                 "id":{"type": "keyword", "store": true},  
                 "title":{"type": "text","store": true,"index" : true,"analyzer": "ik_max_word"}, 
                 "from":{"type": "keyword","store": true}, 
                 "readCount":{"type": "integer","store": true},  
                 "content":{"type": "text","store": false,"index": false},
                 "time": {"type": "keyword", "index": false}
             }  
         }  
    }  
}
  • Hbase中创建对应的表
    • 需要使用root用户来操作
    • 启动HDFS
      • start-dfs.sh
    • 启动Zookeeper
      • /export/servers/zookeeper-3.4.5-cdh5.14.0/bin/start-zk-all.sh
    • 启动Hbase
      • start-hbase.sh
    • 创建表
      • create ‘articles’,‘article’
  • 构建Excel文件解析类
package cn.hanjiaxiaozhi.util;import cn.hanjiaxiaozhi.bean.EsArticle;
import org.apache.poi.xssf.usermodel.XSSFRow;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;/**
 * @ClassName ExcelUtil
 * @Description TODO 用于解析Excel中的数据,封装成JavaBean工具类
 * @Date 2020/7/5 15:46
 * @Create By     Frank
 */
public class ExcelUtil {/**
     * 用于解析Excel文件的数据
     * @param path
     * @return
     */
    public static List<EsArticle> parseExcelData(String path) throws IOException {
        //构建一个返回值
        List<EsArticle> lists = new ArrayList<>();
        //将文件构建输入流
        FileInputStream inputStream = new FileInputStream(path);
        //解析Excel文件,得到每张表
        XSSFWorkbook sheets = new XSSFWorkbook(inputStream);
        //获取对应的表格
        XSSFSheet sheet = sheets.getSheetAt(0);
        //先获取这张表格总行数,然后迭代取每一行
        int lastRowNum = sheet.getLastRowNum();
        //列的名称,不要,从1开始
        for(int i = 1;i <= lastRowNum ;i++){
            //拿到每一行的内容
            XSSFRow row = sheet.getRow(i);
            //取出每一行的每一列
            String id = row.getCell(0).toString();//id
            String title = row.getCell(1).toString();//标题
            String from = row.getCell(2).toString();//来源
            String time = row.getCell(3).toString();//时间
            String readCount = row.getCell(4).toString();//阅读次数
            String content = row.getCell(5).toString();//正文内容
            //将每一行封装成JavaBean对象
            EsArticle esArticle = new EsArticle(id, title, from, time, readCount, content);
            //放入集合
            lists.add(esArticle);
        }//返回
        return lists;
    }
}

  • 构建Hbase读写工具类
package cn.hanjiaxiaozhi.util;import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;
import sun.security.pkcs11.P11Util;import java.io.IOException;/**
 * @ClassName HbaseUtil
 * @Description TODO 用于读写Hbase
 * @Date 2020/7/5 16:13
 * @Create By     Frank
 */
public class HbaseUtil {
​
​
    private static Table getHbaseTable(String tableName) throws IOException {
        //获取一个Hbase连接
        Configuration conf = HBaseConfiguration.create();
        //指定zookeeper的地址
        conf.set("hbase.zookeeper.quorum","node-01:2181,node-02:2181,node-03:2181");
        Connection conn = ConnectionFactory.createConnection(conf);
        //构建表的对象
        Table table = conn.getTable(TableName.valueOf(tableName));
        return table;
    }/**
     *
     * 将数据写入Hbase
     * @param tableName
     * @param rowkey
     * @param family
     * @param column
     * @param value
     */
    public static void writeToHbase(String tableName,String rowkey,String family,String column,String value) throws IOException {
        //构建Hbase表的对象
        Table table = getHbaseTable(tableName);
        //构建Put对象
        Put put = new Put(Bytes.toBytes(rowkey));
        //配置
        put.addColumn(Bytes.toBytes(family),Bytes.toBytes(column),Bytes.toBytes(value));
        //执行
        table.put(put);
    }/**
     * 通过rowkey返回正文的内容
     * @param tableName
     * @param rowkey
     * @param family
     * @param column
     * @return
     * @throws IOException
     */
    public static String readFromHbase(String tableName,String  rowkey,String family,String column) throws IOException {
        //获取表的对象
        Table table = getHbaseTable(tableName);
        //获取某个rowkey的数据
        Get get = new Get(Bytes.toBytes(rowkey));
        //执行获取这个rowkey所有的数据
        Result result = table.get(get);
        //返回content这一列的数据
        byte[] content = result.getValue(Bytes.toBytes(family), Bytes.toBytes(column));
        return  Bytes.toString(content);
    }}

  • 构建ES读写工具类
package cn.hanjiaxiaozhi.util;import cn.hanjiaxiaozhi.bean.EsArticle;
import com.alibaba.fastjson.JSON;
import org.elasticsearch.action.bulk.BulkRequestBuilder;
import org.elasticsearch.action.index.IndexRequestBuilder;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.transport.TransportClient;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.transport.TransportAddress;
import org.elasticsearch.common.xcontent.XContentType;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.transport.client.PreBuiltTransportClient;import java.net.InetAddress;
import java.net.UnknownHostException;
import java.util.ArrayList;
import java.util.List;/**
 * @ClassName EsUtil
 * @Description TODO 读写ES
 * @Date 2020/7/5 16:13
 * @Create By     Frank
 */
public class EsUtil {static String indexName = "articles";
    static  String typeName = "article";
​
​
    public  static TransportClient getESClient() throws UnknownHostException {
        Settings settings = Settings.builder().put("cluster.name","myes").build();
        TransportClient client = new PreBuiltTransportClient(settings)
                .addTransportAddress(new TransportAddress(InetAddress.getByName("node-01"),9300))
                .addTransportAddress(new TransportAddress(InetAddress.getByName("node-02"),9300))
                .addTransportAddress(new TransportAddress(InetAddress.getByName("node-03"),9300));return client;
    }/**
     * 将数据写入ES
     * @param esArticles
     * @throws UnknownHostException
     */
    public static void writeToES(List<EsArticle> esArticles) throws UnknownHostException {
        //获取一个ES的客户端
        TransportClient esClient = getESClient();
        //将集合中的每条数据写入ES
        BulkRequestBuilder bulk = esClient.prepareBulk();
        //迭代
        for (EsArticle esArticle : esArticles) {
            //将每个JavaBean对象变成JsonString
            String jsonString = JSON.toJSONString(esArticle);
            //构建写入请求
            IndexRequestBuilder requestBuilder = esClient.prepareIndex(indexName, typeName, esArticle.getId()).setSource(jsonString, XContentType.JSON);
            //放入bulk
            bulk.add(requestBuilder);
        }
        //执行bulk
        bulk.get();
    }/**
     * 通过搜索的关键词对标题进行匹配,将符合的数据用java Bean返回
     * @param keyword
     * @return
     */
    public static List<EsArticle> readFromEs(String keyword) throws UnknownHostException {
        //构建一个返回值
        List<EsArticle> lists = new ArrayList<>();
        //获取客户端
        TransportClient esClient = getESClient();
        //构建一个查询器
        SearchResponse title = esClient.prepareSearch(indexName)
                .setTypes(typeName)
                .setQuery(QueryBuilders.termQuery("title", keyword))
                .get();
        //得到 符合条件的数据
        SearchHit[] hits = title.getHits().getHits();
        for (SearchHit hit : hits) {
            //获取对应的JSON字符串
            String sourceAsString = hit.getSourceAsString();
            //转为JavaBean
            EsArticle esArticle = JSON.parseObject(sourceAsString, EsArticle.class);
            //放入list集合
            lists.add(esArticle);
        }
        //返回
        return lists;
    }
}


构建主程序

package cn.hanjiaxiaozhi.app;import cn.hanjiaxiaozhi.bean.EsArticle;
import cn.hanjiaxiaozhi.util.EsUtil;
import cn.hanjiaxiaozhi.util.ExcelUtil;
import cn.hanjiaxiaozhi.util.HbaseUtil;
import org.apache.hadoop.hbase.util.Bytes;import java.io.FileNotFoundException;
import java.io.IOException;
import java.net.UnknownHostException;
import java.util.List;/**
 * @ClassName TestEsAndHbase
 * @Description TODO 用于实现将Excel中的数据写入Hbase和ES,通过ES构建二级索引查询
 *      从ES中根据索引得到Id
 *      根据id到Hbase查询正文
 * @Date 2020/7/5 15:40
 * @Create By     Frank
 */
public class TestEsAndHbase {static String tableName = "articles";
    static String family = "article";public static void main(String[] args) throws IOException {
        //todo:1-读取Excel的数据,将每条数据封装成JavaBean
        //定义文件的路径
        String path = "datas/excel/hbaseEs.xlsx";
        //将文件中的每一行变成一个javaBean返回
        List<EsArticle> esArticles = ExcelUtil.parseExcelData(path);
//        System.out.println(esArticles);
        //todo:2-将数据写入Hbase和ES
//        writeData(esArticles);
        //todo:3-实现按照标题查询
        search("意大利");
    }/**
     * 用于根据标题中的关键字,返回正文的内容
     * @param keyword
     */
    private static void search(String keyword) throws IOException {
        //根据标题从ES中通过索引匹配返回document
        List<EsArticle> esArticles = EsUtil.readFromEs(keyword);
        //根据返回的ID到Hbase中查询正文
        for (EsArticle esArticle : esArticles) {
            //获取符合条件的数据的id
            String id = esArticle.getId();
            //根据id 到hbase中查询
            String content = HbaseUtil.readFromHbase(tableName, id, family, "content");
            //打印结果
            System.out.println(content);
        }
    }private static void writeData(List<EsArticle> esArticles) throws IOException {
        //写入ES
        EsUtil.writeToES(esArticles);
        //写入Hbase
        writeDataToHbase(esArticles);
    }private static void writeDataToHbase(List<EsArticle> esArticles) throws IOException {
        //取每一条数据,写入Hbase
        for (EsArticle esArticle : esArticles) {
            HbaseUtil.writeToHbase(tableName,esArticle.getId(),family,"title",esArticle.getTitle());
            HbaseUtil.writeToHbase(tableName,esArticle.getId(),family,"from",esArticle.getFrom());
            HbaseUtil.writeToHbase(tableName,esArticle.getId(),family,"time",esArticle.getTime());
            HbaseUtil.writeToHbase(tableName,esArticle.getId(),family,"readCount",esArticle.getReadCount());
            HbaseUtil.writeToHbase(tableName,esArticle.getId(),family,"content",esArticle.getContent());
        }
    }
}

二级索引Maven依赖

 <!-- 指定仓库位置,依次为aliyun、cloudera和jboss仓库 -->
    <repositories>
        <repository>
            <id>aliyun</id>
            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
        </repository>
        <repository>
            <id>cloudera</id>
            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
        </repository>
        <repository>
            <id>jboss</id>
            <url>http://repository.jboss.com/nexus/content/groups/public</url>
        </repository>
    </repositories>
    <dependencies>
        <!--es客户端-->
        <dependency>
            <groupId>org.elasticsearch.client</groupId>
            <artifactId>transport</artifactId>
            <version>6.0.0</version>
        </dependency>
        <!--日志记录器-->
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-core</artifactId>
            <version>2.9.1</version>
        </dependency>
        <!--用于解析JSON的工具类-->
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.47</version>
        </dependency>
        <!--单元测试-->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
        </dependency>
        <!--用于解析Excel表格的工具类-->
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml-schemas</artifactId>
            <version>3.8</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>3.8</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>3.8</version>
        </dependency>
        <!--Hbase的依赖-->
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>1.2.0-cdh5.14.0</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值