1.倒排索引
查询数据库时如果是根据主键Id查询,因为主键id本身带有索引,所以查询效率很快.但是如果字段上没有索引,此时如果进行查询,MySQL的策略则是逐条扫描,这样的效率是非常慢的.这种查询方式为正向索引.
es采用一种名为倒排索引的查询方式.
1.1 分词
在定义es索引库的时候,如果字段的类型为text,并且指定分词器时,则会进行分词,并将分出的词进行倒排索引.
-
mysql的表结构
-
id titile price 1 小米手机 3499 2 华为手机 4999 3 华为小米充电器 49 4 小米手环 299
-
-
对应的 es mapping
{
id : 1,
title : "小米手机",
price : 3499
}
-
,假设 title 字段的类型为 text , 分词器采用IK分词器,则所有title字段则会被分词,创建倒排索引表, 则会被分词小米,手机,华为,充电器,手环,则title字段的倒排索引表为
词条(term) | 文档id |
---|---|
小米 | 1,3,,4 |
手机 | 1,2 |
华为 | 2,3 |
充电器 | 3 |
手环 | 4 |
-
可见 这种根据词条反向查询文档id的查询与Mysql的方式正好相反,解释了什么叫倒排.
1.2 搜索
-
搜索过程 :
-
搜索"华为手机"
-
进行分词 => "华为","手机"
-
在倒排索引表进行搜索,得到文档id为 1,2 和 2,3
-
将结果集 1, 2 ,3保存
-
根据文档id 1 ,2 ,3 去索引库进行查找.
-
2. Mysql 与 es
两者并不是替代关系,而是根据自身的优势和劣势进行互补,在进行不同的数据操作时,选择合适的技术.
-
Mysql 数据库负责事务类型的操作,可以保证数据的安全和一致性,如果是根据主键id查询,则可以直接使用SQL语句.
-
es负责海量数据的搜索,分析,计算.
-
额...比如做后台管理系统的时候,就可以采用Mysql
-
但是..对于用户的搜索,则采用es
-
3. 创建索引时的mapping属性
-
mapping是对索引库中文档的约束,常见的mapping属性包括
-
type : 字段数据类型,常见的简单类型有:
-
字符串 : text(可分词的文本),keyword(精确值)
-
数值 : long , integer , double , float ...
-
判断 : boolean
-
日期 : date
-
对象 : object
-
-
index : 是否创建倒排索引,默认true,创建
-
analyzer : standard , ik_smart , ik_max_word , pinyin ...
-
properties : 子字段
-
4. 索引(index)操作的DSL与javaRestClient
Test测试
private RestHighLevelClient esClient;
@BeforeEach //建立客户端
void setUp(){
this.esClient = new RestHighLevelClient(RestClient.builder(HttpHost.create(
"xxx.xx.71.234:9200"
)));
}
@AfterEach //关闭客户端
void closeClient() throws IOException {
this.esClient.close();
}
4.1 创建索引
-
DSL
PUT /hotel
{
"settings": {
"analysis": {
"analyzer": {
"text_analyzer":{
"tokenizer":"ik_max_word",
"filter":"py"
},
"completion_analyzer":{
"tokenizer" : "keyword",
"filter":"py"
}
},
"filter": {
"py":{
"type": "pinyin",
"keep_full_pinyin": false,
"keep_joined_full_pinyin": true,
"keep_original": true,
"limit_first_letter_length": 16,
"remove_duplicated_term": true,
"none_chinese_pinyin_tokenize": false
}
}
}
},
"mappings": {
"properties": {
"id":{
"type": "keyword"
},
"name":{
"type": "text",
"analyzer": "text_analyzer",
"search_analyzer": "ik_max_word",
"copy_to": "all"
},
"address":{
"type": "keyword"
},
"price":{
"type": "integer"
},
"score":{
"type": "integer"
},
"brand":{
"type": "keyword",
"copy_to": "all"
},
"city":{
"type": "keyword"
},
"starName":{
"type": "keyword"
},
"bussiness":{
"type": "keyword",
"copy_to": "all"
},
"location":{
"type": "geo_point"
},
"pic":{
"type": "keyword",
"index": false
},
"all":{
"type": "text",
"analyzer": "text_analyzer",
"search_analyzer": "ik_max_word"
},
"suggestion":{
"type": "completion",
"analyzer": "completion_analyzer"
}
}
}
}
4.2 获取索引
-
DSL
GET /hotel
Rest java 判断索引是否存在
@Test
void indexExist() throws IOException {
GetIndexRequest getIndexRequest = new GetIndexRequest("hotel");
boolean exists = esClient.indices().exists(getIndexRequest, RequestOptions.DEFAULT);
System.err.println(exists);
}
4.3 删除索引
-
DSL
DELETE /hotel
Rest java 删除索引
@Test
void deleteIndex() throws IOException {
DeleteIndexRequest deleteIndexRequest = new DeleteIndexRequest("user");
esClient.indices().delete(deleteIndexRequest,RequestOptions.DEFAULT);
}
4.4 新增字段
-
DSL
PUT /hotel/_mapping
{
"properties":{
"age":{
"type":integer
}
}
}
5.文档(document)操作的DSL与javaRestClient
5.1批量新增文档
-
DSL
POST /user/_doc/1
{
"id" : 1,
"name" : "奥利给",
"gender" : false,
"email" : "1111111111@qq.com"
}
Rest Client
@Test
void createDocument() throws IOException {
BulkRequest bulkRequest = new BulkRequest();
//user3
User user3 = new User();
user3.setId(3L);
user3.setName("user3");
user3.setEmail("@qq.com");
user3.setGender(false);
//user4
User user4 = new User();
user4.setId(4L);
user4.setName("user4");
user4.setEmail("@qq.com");
user4.setGender(false);
bulkRequest.add(new IndexRequest("user").source(JSON.toJSONString(user3), XContentType.JSON).id("3"));
bulkRequest.add(new IndexRequest("user").source(JSON.toJSONString(user4), XContentType.JSON).id("4"));
esClient.bulk(bulkRequest, RequestOptions.DEFAULT);
}
5.2 删除文档
-
DSL
DELETE /user/_doc/1
Rest Client
@Test
void deleteDocument() throws IOException {
DeleteRequest deleteRequest = new DeleteRequest("user").id("4");
esClient.delete(deleteRequest, RequestOptions.DEFAULT);
}
5.3 修改文档
-
DSL
POST /user/_update/1
{
"doc":{
"name" : "the 奥利给 has benn updaated"
}
}
Rest Client
@Test
void updateDocument() throws IOException {
UpdateRequest updateRequest = new UpdateRequest("user","1");
updateRequest.doc("name","the 奥利给 has benn updaated");
esClient.update(updateRequest,RequestOptions.DEFAULT);
}
6. 文档查询
6.1 match all
-
DSL
#文档 全查询 match
GET /hotel/_search
{
"query": {
"match_all": {}
}
}
Rest Client
@Test
void matchAll() throws IOException {
SearchRequest searchRequest = new SearchRequest("hotel");
SearchResponse response =
esClient.search(searchRequest, RequestOptions.DEFAULT);
//结果解析
long value = response.getHits().getTotalHits().value;
System.err.println(value);
SearchHit[] hits = response.getHits().getHits();
for (SearchHit hit : hits) {
System.err.println(hit.getSourceAsString());
}
}
6.2 match field
-
DSL
GET /hotel/_search
{
"query": {
"match": {
"all": "上海"
}
}
}
Rest Client
@Test
void match() throws IOException {
SearchRequest searchRequest = new SearchRequest("hotel");
searchRequest.source().query(QueryBuilders.matchQuery("all", "上海"));
SearchResponse search =
esClient.search(searchRequest, RequestOptions.DEFAULT);
long value = search.getHits().getTotalHits().value;
System.err.println(value);
}
6.3 multi match
-
DSL
GET /hotel/_search
{
"query": {
"multi_match": {
"query": "上海",
"fields": [
"name",
"business"
]
}
}
}
Rest Client
@Test
void multiMatch() throws IOException {
SearchRequest searchRequest = new SearchRequest("hotel");
searchRequest.source().query(QueryBuilders.multiMatchQuery("上海", "name","business"));
SearchResponse search =
esClient.search(searchRequest, RequestOptions.DEFAULT);
TotalHits totalHits = search.getHits().getTotalHits();
System.err.println(totalHits);
}
6.4 term
-
DSL
GET /hotel/_search
{
"query": {
"term": {
"city": {
"value": "上海"
}
}
}
}
Rest Client
@Test
void termSearch() throws IOException {
SearchRequest searchRequest = new SearchRequest("hotel");
searchRequest.source().query(QueryBuilders.termQuery("city", "上海"));
SearchResponse search =
esClient.search(searchRequest, RequestOptions.DEFAULT);
long totalHits = search.getHits().getTotalHits().value;
System.out.println(totalHits);
}
6.5 range
-
DSL
GET /hotel/_search
{
"query": {
"range": {
"price": {
"gte": 0,
"lte": 2000
}
}
}
}
Rest Client
@Test
void termSearch() throws IOException {
SearchRequest searchRequest = new SearchRequest("hotel");
searchRequest.source().query(QueryBuilders.termQuery("city", "上海"));
SearchResponse search =
esClient.search(searchRequest, RequestOptions.DEFAULT);
long totalHits = search.getHits().getTotalHits().value;
System.out.println(totalHits);
}
6.6 bool "and" 搜索
-
must 参与算分 "与"
-
filter 不参与算分 "与"
-
must not "非" 不参与算分
-
should "或" 参与算分
-
DSL
-
GET /hotel/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"city": {
"value": "上海"
}
}
}
],
"filter": [
{
"range": {
"price": {
"gte": 0,
"lte": 2000
}
}
}
]
}
}
}
Rest Client
@Test
void booleanSearch() throws IOException {
SearchRequest searchRequest = new SearchRequest("hotel");
BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
boolQuery.must(QueryBuilders.termQuery("city", "上海"));
boolQuery.filter(QueryBuilders.rangeQuery("price").gte(0).lte(2000));
searchRequest.source().query(boolQuery);
SearchResponse response =
esClient.search(searchRequest, RequestOptions.DEFAULT);
long value = response.getHits().getTotalHits().value;
System.out.println(value);
}
6.7 sort and page
-
DSL
GET /hotel/_search
{
"query": {
"match_all": {}
},
"sort": [
{
"price": {
"order": "desc"
}
}
],
"from": 0,
"size": 2
}
Rest Client
@Test
void sortAndPage() throws IOException {
SearchRequest searchRequest = new SearchRequest("hotel");
searchRequest.source().sort("price", SortOrder.DESC).from(0).size(2);
SearchResponse response =
esClient.search(searchRequest, RequestOptions.DEFAULT);
SearchHit[] hits = response.getHits().getHits();
for (SearchHit hit : hits) {
String sourceAsString = hit.getSourceAsString();
System.out.println(sourceAsString);
}
}
6.7.1 逻辑分页
-
逻辑分页 : es的倒排索引方式导致其分页效率不是很高,当from = 990 , size = 10 时 ,是先获取1000条数据,再去截取最后10条.
-
ES是分布式的,所以会面临深度分页的问题.例如按price排序之后,获取from = 990 , size = 10 的数据 :
-
首先在每个数据分片上都排序并查询前1000条文档
-
然后将所有节点的结果聚合,在内存中重新排序选出前1000条文档
-
最后从这1000条中,选取990开始的10条文档
-
-
如果搜索页数过深,或者结果集(from + size)越大,对内存和CPU的消耗也就越高.因此ES设定结果集查询的上限为10000(from + size <= 10000)
-
百度的ES集群假设有5000台,再每个分片上取1000,统共500万条数据在内存中运算..
-
正常业务对于用户的分页查询给出的结果也不会超过10000条(基本上会比10000很少),从情理来说,用户也不会做过深的查找.
-
深度分页解决方案,假设我真的要查询10000条以上的数据,针对深度分页,ES提供了两种解决方案
-
search after : 分页时需要排序 , 原理是从上一次的排序值开始,查询下一页数据.
-
由于是记录了上一次排序的最后值 , 再次查询基于此值向后查询,所以此类深度分页解决方案所以,只能向前翻页----
-
-
6.8 highLight
-
默认情况下 , ES 搜索字段必须与高亮字段一致.
-
DSL
GET /hotel/_search
{
"query": {
"match": {
"all": "上海"
}
},
"highlight": {
"fields": {
"name": {
"require_field_match": "false"
}
}
}
}
Rest Client
@Test
void highLight() throws IOException {
SearchRequest searchRequest = new SearchRequest("hotel");
searchRequest.source().query(QueryBuilders.matchQuery("all", "上海"));
searchRequest.source().highlighter(
new HighlightBuilder().field("name").requireFieldMatch(false));
SearchResponse response = esClient.search(searchRequest, RequestOptions.DEFAULT);
SearchHit[] hits = response.getHits().getHits();
for (SearchHit hit : hits) {
Map<String, HighlightField> highlightFields = hit.getHighlightFields();
HighlightField name = highlightFields.get("name");
Text[] fragments = name.getFragments();
for (Text fragment : fragments) {
System.out.println(fragment.string());
System.out.println("-----");
}
}
}
6.9 sort by geoDistance
-
DSL
GET /hotel/_search
{
"query": {
"match_all": {}
},
"sort": [
{
"_geo_distance": {
"location": {
"lat": 31,
"lon": 131
},
"order": "asc"
}
}
]
}
Rest Client
@Test
void sortByGeo() throws IOException {
SearchRequest searchRequest = new SearchRequest("hotel");
searchRequest.source().
sort(SortBuilders.geoDistanceSort("location", new GeoPoint("31,121")).
order(SortOrder.ASC).unit(DistanceUnit.KILOMETERS));
SearchResponse response = esClient.search(searchRequest, RequestOptions.DEFAULT);
SearchHit[] hits = response.getHits().getHits();
for (SearchHit hit : hits) {
String sourceAsString = hit.getSourceAsString();
Object[] sortValues = hit.getSortValues();
for (Object sortValue : sortValues) {
System.err.println(Double.parseDouble(sortValue.toString()));
}
System.out.println(sourceAsString);
}
}
6.10 geo_distance
-
DSL
GET /hotel/_search
{
"query": {
"geo_distance":{
"distance":"15km",
"location":"31.21,121.5"
}
}
}
6.11 compound 复合查询
6.11.1 function score
-
DSL
-
当我们利用match查询时,文档结果会根据与搜索词条的关联度打分(_score),返回结果时按照分值降序排列.
GET /hotel/_search
{
"query": {
"function_score": {
"query": { // 原始查询条件,搜索文档并根据相关性打分(query score)
"match": {
"all": "外滩"
}
},
"functions": [
{
"filter": { // 过滤条件,符合条件的文档才会被重新算分
"term": {
"city": "上海"
}
},
"weight": 10 //算分函数 详见下面解释
}
],
"boost_mode": "multiply" //加权模式 详见下面解释
}
}
}
-
算分函数 : 算分函数的结果称为function score , 将来会与query score 运算,得到新的算分 , 常见的算分有:
-
weight : 给定一个常量值 , 作为函数结果 (function score) 与 相关性算分进行运算.
-
field_value_factor : 用文档中的某个字段值作为结果 (百度竞价 谁出钱高 谁排上面).
-
random_score : 随机数.
-
script_score : 自定义计算公式,公式结果作为函数的结果.
-
-
加权模式 , 定义function score 与 query score的运算方式,包括
-
multiply : 两者相乘.默认
-
replace : 用function_score 替换 query_score
-
其他 : sum , avg , max , min
-
Rest Client
@Test
void functionTest() throws IOException {
SearchRequest searchRequest = new SearchRequest("hotel");
BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
FunctionScoreQueryBuilder functionScoreQuery =
QueryBuilders.functionScoreQuery(boolQuery, new FunctionScoreQueryBuilder.FilterFunctionBuilder[]{
new FunctionScoreQueryBuilder.FilterFunctionBuilder(
QueryBuilders.termQuery("isAD", true),
ScoreFunctionBuilders.weightFactorFunction(10)
)
}).boostMode(CombineFunction.SUM);
searchRequest.source().query(functionScoreQuery);
SearchResponse response = esClient.search(searchRequest, RequestOptions.DEFAULT);
}
7.数据聚合
7.1 文档的分类
-
桶 (bucket)聚合 : 用来对文档进行分组.
-
TermAggregation : 按照文档字段值分组.
-
Date Histogram : 按照日期阶梯分组,例如一周为一组,或者一月为一组.
-
-
度量(Metric)聚合 : 用以计算一些值,比如 : 最大值,最小值,平均值等..
-
Avg : 求平均值
-
Max : 求最大值
-
Min : 求最小值
-
Stats : 同时求max,min,avg,sum等.
-
-
管道(pipeline)聚合 : 其他聚合的结果为基础做聚合
7.1 Bucket聚合
-
DSL
GET /hotel/_search
{
"query":{
"range":{
"price":{
"lte":200
}
}
},
"size": 0,
"aggs": {
"brandAggs": {
"terms": {
"field": "brand",
"size": 10
},
"aggs": {
"scoreAgg": {
"stats": {
"field": "score"
}
}
}
}
}
}
Rest Client
@Test
void brandAgg() throws IOException {
SearchRequest searchRequest = new SearchRequest("hotel");
searchRequest.source().size(0);
//term agg
TermsAggregationBuilder brandTermAgg = AggregationBuilders.terms("brandAgg").field("brand").size(10);
//avg agg
AvgAggregationBuilder scoreAvgAgg = AggregationBuilders.avg("scoreAvg").field("score");
//min agg
MinAggregationBuilder scoreMinAgg = AggregationBuilders.min("scoreMin").field("score");
brandTermAgg.subAggregation(scoreAvgAgg);
brandTermAgg.subAggregation(scoreMinAgg);
searchRequest.source().aggregation(brandTermAgg);
SearchResponse response =
esClient.search(searchRequest, RequestOptions.DEFAULT);
Terms mySuggestion = response.getAggregations().get("brandAgg");
List<? extends Terms.Bucket> buckets = mySuggestion.getBuckets();
for (Terms.Bucket bucket : buckets) {
String keyAsString = bucket.getKeyAsString();
System.err.println(keyAsString);
Aggregations aggregations = bucket.getAggregations();
Avg scoreAvg = aggregations.get("scoreAvg");
System.out.println("平均值"+scoreAvg.getValue());
Min scoreMin = aggregations.get("scoreMin");
System.out.println("最小值"+scoreMin.getValue());
}
}
7.2 多条件聚合
@Test
void multiAgg() throws IOException {
SearchRequest searchRequest = new SearchRequest("hotel");
searchRequest.source().size(0);
//brand Agg
TermsAggregationBuilder brandAgg =
AggregationBuilders.terms("brandAgg").field("brand").size(10);
//starName agg
TermsAggregationBuilder starNameAgg =
AggregationBuilders.terms("starNameAgg").field("starName").size(10);
//city Agg
TermsAggregationBuilder cityAgg =
AggregationBuilders.terms("cityAgg").field("city").size(10);
searchRequest.source().aggregation(brandAgg);
searchRequest.source().aggregation(starNameAgg);
searchRequest.source().aggregation(cityAgg);
SearchResponse response =
esClient.search(searchRequest, RequestOptions.DEFAULT);
Terms brandTerms = response.getAggregations().get("brandAgg");
List<? extends Terms.Bucket> buckets = brandTerms.getBuckets();
for (Terms.Bucket bucket : buckets) {
String hotelName = bucket.getKeyAsString();
System.err.println(hotelName);
}
Terms starNameTerms = response.getAggregations().get("starNameAgg");
List<? extends Terms.Bucket> buckets1 = starNameTerms.getBuckets();
for (Terms.Bucket bucket : buckets1) {
String hotelName = bucket.getKeyAsString();
System.out.println(hotelName);
}
}
8.自动补全
8.1 分词器
-
ES中分词器(analyzer)的组成包含三部分.
-
character filters : 在tokenizer之前对文本进行处理.例如删除字符,替换字符.
-
tokenizer : 将文本按照一定的规则切割成词条(term).例如keyword,就是不分词,还有ik_smart
-
filter : 将tokenizer输出的词条做进一步处理.例如大小写转换,同义词处理,拼音处理等等.
-
拼音分词器适合在创建倒排索引的时候使用,但不能再搜索的时候使用
8.2 语法
- 构造suggestion字段
@Data
@NoArgsConstructor
public class HotelDoc {
private Long id;
private String name;
private String address;
private Integer price;
private Integer score;
private String brand;
private String city;
private String starName;
private String business;
private String location;
private String pic;
private Boolean isAd;
private List<String> suggestion;
public HotelDoc(Hotel hotel) {
this.id = hotel.getId();
this.name = hotel.getName();
this.address = hotel.getAddress();
this.price = hotel.getPrice();
this.score = hotel.getScore();
this.brand = hotel.getBrand();
this.city = hotel.getCity();
this.starName = hotel.getStarName();
this.business = hotel.getBusiness();
this.location = hotel.getLatitude() + ", " + hotel.getLongitude();
this.pic = hotel.getPic();
if(this.business.contains("/")){
String[] arr = this.business.split("/");
//添加元素
this.suggestion = new ArrayList<>();
this.suggestion.add(this.brand);
Collections.addAll(this.suggestion , arr);
}
this.suggestion = Arrays.asList(this.brand , this.business);
}
}
-
DSL
GET /hotel/_search
{
"size": 0,
"suggest": {
"mySuggestion": {
"text": "sd",
"completion": {
"field": "suggestion",
"skip_duplicates":true,
"size":2
}
}
}
}
Rest Client
@Test
void hotelSuggestion() throws IOException {
SearchRequest searchRequest = new SearchRequest("hotel");
searchRequest.source().suggest(new SuggestBuilder().
addSuggestion("mySuggestion",
SuggestBuilders.completionSuggestion("suggestion").prefix("sd").skipDuplicates(true).size(10)));
SearchResponse response =
esClient.search(searchRequest, RequestOptions.DEFAULT);
CompletionSuggestion mySuggestion =
response.getSuggest().getSuggestion("mySuggestion");
List<CompletionSuggestion.Entry.Option> options = mySuggestion.getOptions();
for (CompletionSuggestion.Entry.Option option : options) {
System.err.println(option.getText().string());
}
}
9. ES集群
9.1 ES集群结构
-
单机的ES做数据存储,必然会面临两个问题,海量数据存储问题,单点故障问题.
-
海量数据存储问题,将索引库从逻辑上拆分为N个分片(shard),存储到多个节点.
-
单点故障问题,将分片数据在不同节点备份.
-
9.2 ES集群的节点角色
节点类型 | 配置参数 | 默认值 | 节点职责 |
---|---|---|---|
master eligible | node.master | true | 备选主节点:主节点可以管理和记录集群状态,决定跟片在哪个节点,处理创建和删除索引库的请求. |
data | node.data | true | 数据节点 : 存储数据,搜索,聚合,CRUD |
ingest | node.ingest | true | 数据存储之前的预处理 |
coordianting | 上面3个参数都为false,则为coordinating节点 | 无 | 路由请求到其他节点,合并其他节点处理的结果,返回给用户. |
-
ES中的每个节点角色都有自己不同的职责,因此建议部署集群时,每个节点都有独立的角色.
9.3 ES集群的脑裂
-
由于网络阻塞问题导致的多主节点情况,在es 7.0 之后已经得到解决.
9.4 ES分布式新增流程
当新增文档时,应该保存到不同分片,保证数据均衡,那么coordinating node如何确定数据该存储到哪个分片呢?elasticsearch会通过hash算法来计算文档应该存储到哪个分片.
shard = hash(_routing) % number_of_shards
-
_routing默认是文档的id
-
算法与分片数量有关,因此索引库一旦创建,分片的数量就不能修改了!
9.5 ES分布式查询流程
elasticsearch的查询分成两个阶段
-
scatter phase:分散阶段,coordinating node会把请求分发到每一个分片
-
gather phase:聚集阶段,coordinating node汇总data node的搜索结果,并处理为最终结果集返回给用户
9.6 ES 故障转移
集群的master节点会监控集群中的节点状态,如果发现有节点宕机,会立即将宕机节点的分片数据迁移到其它节点,确保数据安全,这个叫做故障转移.