1、应用场景
ES优缺点
- 优点:可以
构建全文索引
,根据需求可以将任意的数据构建索引来查询 - 缺点:数据量大,
性能
不能满足高实时要求,本身数据安全的隐患
相对较高
Hbase优缺点
- 优点:实现大量数据集
高性能
的实时读写,数据相对安全
- 缺点:rowkey作为唯一索引,复杂业务中,查询条件肯定是变化多样的
如果查询条件不是rowkey的前缀
无法走索引,只能构建二级索引
- 为什么不用Hbase中构建二级索引表?
- 只要有一个需求,就需要构建一个Hbase二级索引表
- 原表:rowkey:id
- 索引表:rowkey:name
- id
- 索引表:rowkey:age
- id
- 只要有一个需求,就需要构建一个Hbase二级索引表
ES构建索引表
- 将所有的条件,在ES中存储构建索引,指向Hbase中rowkey
- ES中数据
documentId | age | name | rowkey:id | |
---|---|---|---|---|
0 | 18 | zhangsan | male | 001 |
- Hbase中数据
rowkey:id | age | name | sex |
---|---|---|---|
001 | 18 | zhangsan | male |
002 | 19 | lisi | female |
- 避免了,在Hbase中需要构建多张二级索引表
2、需求分析
- 需求:通过检索标题中的关键字、来源、时间、阅读次数等条件查询,获取文章正文内容
- 数据
ID
标题
来源
时间
阅读次数
正文
- 能不能存储在ES中?
- 可以的
- 有问题存在
- 性能或者 安全性,没有必要将所有数据存储在ES中
- 如果存储了大量数据,都构建索引,性能会差
实现
- 将数据完整的存放在Hbase中
- id作为Rowkey
- 标题、来源、时间、阅读次数、正文
- 将需要用到的查询条件存储在ES中,将所有条件构建索引
- ID
- 标题
- 来源
- 时间
- 阅读次数
- 用户查询
- step1:在ES中根据查询条件,查询符合条件的数据的ID
- step2:再通过ID到Hbase中获取这篇文章的正文
流程
- step1:读取Excel文件,解析每条数据,封装到javaBean中
每一条数据就是一个JavaBean对象
将所有的JavaBean放在一个集合中 - step2:将JavaBean数据写入Hbase和ES
ES:id、标题、来源、时间、阅读次数
Hbase:所有的字段都存储在Hbase
rowkey:id - step3:基于标题来查询相关数据的正文内容
- 先查询ES,获取这个标题相关的ID
- 根据ID到Hbase中进行查询,返回正文
3、代码实现
- ES中创建对应的索引库
PUT /articles
{
"settings":{
"number_of_shards":3,
"number_of_replicas":1,
"analysis" : {
"analyzer" : {
"ik" : {
"tokenizer" : "ik_max_word"
}
}
}
},
"mappings":{
"article":{
"dynamic":"strict",
"_source": {
"includes": [
"id","title","from","readCount","time"
],
"excludes": [
"content"
]
},
"properties":{
"id":{"type": "keyword", "store": true},
"title":{"type": "text","store": true,"index" : true,"analyzer": "ik_max_word"},
"from":{"type": "keyword","store": true},
"readCount":{"type": "integer","store": true},
"content":{"type": "text","store": false,"index": false},
"time": {"type": "keyword", "index": false}
}
}
}
}
- Hbase中创建对应的表
- 需要使用root用户来操作
- 启动HDFS
- start-dfs.sh
- 启动Zookeeper
- /export/servers/zookeeper-3.4.5-cdh5.14.0/bin/start-zk-all.sh
- 启动Hbase
- start-hbase.sh
- 创建表
- create ‘articles’,‘article’
- 构建Excel文件解析类
package cn.hanjiaxiaozhi.util;
import cn.hanjiaxiaozhi.bean.EsArticle;
import org.apache.poi.xssf.usermodel.XSSFRow;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
/**
* @ClassName ExcelUtil
* @Description TODO 用于解析Excel中的数据,封装成JavaBean工具类
* @Date 2020/7/5 15:46
* @Create By Frank
*/
public class ExcelUtil {
/**
* 用于解析Excel文件的数据
* @param path
* @return
*/
public static List<EsArticle> parseExcelData(String path) throws IOException {
//构建一个返回值
List<EsArticle> lists = new ArrayList<>();
//将文件构建输入流
FileInputStream inputStream = new FileInputStream(path);
//解析Excel文件,得到每张表
XSSFWorkbook sheets = new XSSFWorkbook(inputStream);
//获取对应的表格
XSSFSheet sheet = sheets.getSheetAt(0);
//先获取这张表格总行数,然后迭代取每一行
int lastRowNum = sheet.getLastRowNum();
//列的名称,不要,从1开始
for(int i = 1;i <= lastRowNum ;i++){
//拿到每一行的内容
XSSFRow row = sheet.getRow(i);
//取出每一行的每一列
String id = row.getCell(0).toString();//id
String title = row.getCell(1).toString();//标题
String from = row.getCell(2).toString();//来源
String time = row.getCell(3).toString();//时间
String readCount = row.getCell(4).toString();//阅读次数
String content = row.getCell(5).toString();//正文内容
//将每一行封装成JavaBean对象
EsArticle esArticle = new EsArticle(id, title, from, time, readCount, content);
//放入集合
lists.add(esArticle);
}
//返回
return lists;
}
}
- 构建Hbase读写工具类
package cn.hanjiaxiaozhi.util;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;
import sun.security.pkcs11.P11Util;
import java.io.IOException;
/**
* @ClassName HbaseUtil
* @Description TODO 用于读写Hbase
* @Date 2020/7/5 16:13
* @Create By Frank
*/
public class HbaseUtil {
private static Table getHbaseTable(String tableName) throws IOException {
//获取一个Hbase连接
Configuration conf = HBaseConfiguration.create();
//指定zookeeper的地址
conf.set("hbase.zookeeper.quorum","node-01:2181,node-02:2181,node-03:2181");
Connection conn = ConnectionFactory.createConnection(conf);
//构建表的对象
Table table = conn.getTable(TableName.valueOf(tableName));
return table;
}
/**
*
* 将数据写入Hbase
* @param tableName
* @param rowkey
* @param family
* @param column
* @param value
*/
public static void writeToHbase(String tableName,String rowkey,String family,String column,String value) throws IOException {
//构建Hbase表的对象
Table table = getHbaseTable(tableName);
//构建Put对象
Put put = new Put(Bytes.toBytes(rowkey));
//配置
put.addColumn(Bytes.toBytes(family),Bytes.toBytes(column),Bytes.toBytes(value));
//执行
table.put(put);
}
/**
* 通过rowkey返回正文的内容
* @param tableName
* @param rowkey
* @param family
* @param column
* @return
* @throws IOException
*/
public static String readFromHbase(String tableName,String rowkey,String family,String column) throws IOException {
//获取表的对象
Table table = getHbaseTable(tableName);
//获取某个rowkey的数据
Get get = new Get(Bytes.toBytes(rowkey));
//执行获取这个rowkey所有的数据
Result result = table.get(get);
//返回content这一列的数据
byte[] content = result.getValue(Bytes.toBytes(family), Bytes.toBytes(column));
return Bytes.toString(content);
}
}
- 构建ES读写工具类
package cn.hanjiaxiaozhi.util;
import cn.hanjiaxiaozhi.bean.EsArticle;
import com.alibaba.fastjson.JSON;
import org.elasticsearch.action.bulk.BulkRequestBuilder;
import org.elasticsearch.action.index.IndexRequestBuilder;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.transport.TransportClient;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.transport.TransportAddress;
import org.elasticsearch.common.xcontent.XContentType;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.transport.client.PreBuiltTransportClient;
import java.net.InetAddress;
import java.net.UnknownHostException;
import java.util.ArrayList;
import java.util.List;
/**
* @ClassName EsUtil
* @Description TODO 读写ES
* @Date 2020/7/5 16:13
* @Create By Frank
*/
public class EsUtil {
static String indexName = "articles";
static String typeName = "article";
public static TransportClient getESClient() throws UnknownHostException {
Settings settings = Settings.builder().put("cluster.name","myes").build();
TransportClient client = new PreBuiltTransportClient(settings)
.addTransportAddress(new TransportAddress(InetAddress.getByName("node-01"),9300))
.addTransportAddress(new TransportAddress(InetAddress.getByName("node-02"),9300))
.addTransportAddress(new TransportAddress(InetAddress.getByName("node-03"),9300));
return client;
}
/**
* 将数据写入ES
* @param esArticles
* @throws UnknownHostException
*/
public static void writeToES(List<EsArticle> esArticles) throws UnknownHostException {
//获取一个ES的客户端
TransportClient esClient = getESClient();
//将集合中的每条数据写入ES
BulkRequestBuilder bulk = esClient.prepareBulk();
//迭代
for (EsArticle esArticle : esArticles) {
//将每个JavaBean对象变成JsonString
String jsonString = JSON.toJSONString(esArticle);
//构建写入请求
IndexRequestBuilder requestBuilder = esClient.prepareIndex(indexName, typeName, esArticle.getId()).setSource(jsonString, XContentType.JSON);
//放入bulk
bulk.add(requestBuilder);
}
//执行bulk
bulk.get();
}
/**
* 通过搜索的关键词对标题进行匹配,将符合的数据用java Bean返回
* @param keyword
* @return
*/
public static List<EsArticle> readFromEs(String keyword) throws UnknownHostException {
//构建一个返回值
List<EsArticle> lists = new ArrayList<>();
//获取客户端
TransportClient esClient = getESClient();
//构建一个查询器
SearchResponse title = esClient.prepareSearch(indexName)
.setTypes(typeName)
.setQuery(QueryBuilders.termQuery("title", keyword))
.get();
//得到 符合条件的数据
SearchHit[] hits = title.getHits().getHits();
for (SearchHit hit : hits) {
//获取对应的JSON字符串
String sourceAsString = hit.getSourceAsString();
//转为JavaBean
EsArticle esArticle = JSON.parseObject(sourceAsString, EsArticle.class);
//放入list集合
lists.add(esArticle);
}
//返回
return lists;
}
}
构建主程序
package cn.hanjiaxiaozhi.app;
import cn.hanjiaxiaozhi.bean.EsArticle;
import cn.hanjiaxiaozhi.util.EsUtil;
import cn.hanjiaxiaozhi.util.ExcelUtil;
import cn.hanjiaxiaozhi.util.HbaseUtil;
import org.apache.hadoop.hbase.util.Bytes;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.net.UnknownHostException;
import java.util.List;
/**
* @ClassName TestEsAndHbase
* @Description TODO 用于实现将Excel中的数据写入Hbase和ES,通过ES构建二级索引查询
* 从ES中根据索引得到Id
* 根据id到Hbase查询正文
* @Date 2020/7/5 15:40
* @Create By Frank
*/
public class TestEsAndHbase {
static String tableName = "articles";
static String family = "article";
public static void main(String[] args) throws IOException {
//todo:1-读取Excel的数据,将每条数据封装成JavaBean
//定义文件的路径
String path = "datas/excel/hbaseEs.xlsx";
//将文件中的每一行变成一个javaBean返回
List<EsArticle> esArticles = ExcelUtil.parseExcelData(path);
// System.out.println(esArticles);
//todo:2-将数据写入Hbase和ES
// writeData(esArticles);
//todo:3-实现按照标题查询
search("意大利");
}
/**
* 用于根据标题中的关键字,返回正文的内容
* @param keyword
*/
private static void search(String keyword) throws IOException {
//根据标题从ES中通过索引匹配返回document
List<EsArticle> esArticles = EsUtil.readFromEs(keyword);
//根据返回的ID到Hbase中查询正文
for (EsArticle esArticle : esArticles) {
//获取符合条件的数据的id
String id = esArticle.getId();
//根据id 到hbase中查询
String content = HbaseUtil.readFromHbase(tableName, id, family, "content");
//打印结果
System.out.println(content);
}
}
private static void writeData(List<EsArticle> esArticles) throws IOException {
//写入ES
EsUtil.writeToES(esArticles);
//写入Hbase
writeDataToHbase(esArticles);
}
private static void writeDataToHbase(List<EsArticle> esArticles) throws IOException {
//取每一条数据,写入Hbase
for (EsArticle esArticle : esArticles) {
HbaseUtil.writeToHbase(tableName,esArticle.getId(),family,"title",esArticle.getTitle());
HbaseUtil.writeToHbase(tableName,esArticle.getId(),family,"from",esArticle.getFrom());
HbaseUtil.writeToHbase(tableName,esArticle.getId(),family,"time",esArticle.getTime());
HbaseUtil.writeToHbase(tableName,esArticle.getId(),family,"readCount",esArticle.getReadCount());
HbaseUtil.writeToHbase(tableName,esArticle.getId(),family,"content",esArticle.getContent());
}
}
}
二级索引Maven依赖
<!-- 指定仓库位置,依次为aliyun、cloudera和jboss仓库 -->
<repositories>
<repository>
<id>aliyun</id>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
</repository>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
<repository>
<id>jboss</id>
<url>http://repository.jboss.com/nexus/content/groups/public</url>
</repository>
</repositories>
<dependencies>
<!--es客户端-->
<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>transport</artifactId>
<version>6.0.0</version>
</dependency>
<!--日志记录器-->
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>2.9.1</version>
</dependency>
<!--用于解析JSON的工具类-->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.47</version>
</dependency>
<!--单元测试-->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
<!--用于解析Excel表格的工具类-->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml-schemas</artifactId>
<version>3.8</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.8</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>3.8</version>
</dependency>
<!--Hbase的依赖-->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>1.2.0-cdh5.14.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>