es搜索核心与实战 Day02

最新推荐文章于 2022-07-13 13:50:56 发布

wyj_bili

最新推荐文章于 2022-07-13 13:50:56 发布

阅读量183

点赞数

文章标签： elasticsearch

本文链接：https://blog.csdn.net/wyj_bili/article/details/106769018

版权

es搜索核心与实战 Day02

一、倒排索引

1.搜索引擎

正排索引——文档ld到文档内容和单词的关联+
倒排索引——单词到文档Id的关系

2。倒排索引的核心组成

倒排索引包含两个部分

单词词典 (Term Dictionary)，记录所有文档的单词，记录单词到倒排列表的关联关系

单词词典一般比较大，可以通过B +树或哈希拉链法实现，以满足高性能的插入与查询

倒排列表(Posting List) - 记录了单词对应的文档结合，由倒排索引项组成
倒排索引项(Posting)
1.文档ID
2.词频TF-该单词在文档中出现的次数，用于相关性评分
3.位置(Position) -单词在文档中分词的位置。用于语句搜索(phrase query)
4.偏移(Offset) -记录单词的开始结束位置，实现高亮显示

二、通过Analyzer进行分词

GET _analyze
{
   //
  "analyzer": standard
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

1.standard analyzer

默认分词器
按词切分
小写处理
返回结果

{
  "tokens" : [
    {
      "token" : "2",//返回结果值
      "start_offset" : 0,//结果值开始位置
      "end_offset" : 1,//结果值结束位置
      "type" : "<NUM>",//结果值类型
      "position" : 0//第几个
    },
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "dogs",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "in",
      "start_offset" : 48,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "the",
      "start_offset" : 51,
      "end_offset" : 54,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "summer",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "evening",
      "start_offset" : 62,
      "end_offset" : 69,
      "type" : "<ALPHANUM>",
      "position" : 12
    }
  ]
}

2.simple analyzer

按照非字母切分
非字母的都被去除
小写处理
返回结果

{
  "tokens" : [
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "dogs",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 48,
      "end_offset" : 50,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "the",
      "start_offset" : 51,
      "end_offset" : 54,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "summer",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "evening",
      "start_offset" : 62,
      "end_offset" : 69,
      "type" : "word",
      "position" : 11
    }
  ]
}

3.whitespace analyzer

按空格切分
返回结果

{
  "tokens" : [
    {
      "token" : "2",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "Quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "brown-foxes",
      "start_offset" : 16,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "dogs",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 48,
      "end_offset" : 50,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "the",
      "start_offset" : 51,
      "end_offset" : 54,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "summer",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "evening.",
      "start_offset" : 62,
      "end_offset" : 70,
      "type" : "word",
      "position" : 11
    }
  ]
}

4.stop analyzer

相比Simple Analyzer
多了stop filter
会把the，a，is等修饰性词语去除
返回结果

{
  "tokens" : [
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "dogs",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "summer",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "evening",
      "start_offset" : 62,
      "end_offset" : 69,
      "type" : "word",
      "position" : 11
    }
  ]
}

5.keyword analyzer
不分词，直接将输入当成一个term输出

6.pattern analyzer

通过正则表达式进行分词
默认是\W+,非字符的符号进行分隔

7.english analyzer

三、SearchAPI及URISearch详解

1.URI Search

在URI中使用查询参数

2.Request Body Search

使用Elasticsearch提供的，基于JSON格式的更加完备的Query Domain Specific Language （DSL）

3.搜索的相关性Relevance

搜索是用户和搜索引擎的对话
用户关心的是搜索结果的相关性
- 是否可以找到所有相关的内容
- 有多少不相关的内容被返回了
- 文档的打分是否合理
- 结合业务需求，平衡结果排名

Page Rank算法

不仅仅是内容
更重要的是内容的可信度

4.衡量相关性

Information Retrieval
- Precision (查准率) -尽可能返回较少的无关文档
- Recall (查全率) -尽量返回较多的相关文档
- Ranking -是否能够按照相关度进行排序?

5.URISearch

a.指定字段

查询出指定字段(title)值为2012的数据

GET /movies/_search?q=2012&df=title
{
  "profile": "true"
}

b.泛查询

查询出任意字段值为2012的数据

GET /movies/_search?q=2012
{
	"profile": "true"
}

c.Term and Phrase

Beautiful Mind等效于Beautiful OR Mind
“Beautiful Mind”，等效于Beautiful AND Mind。Phrase 查询，还要求前后顺序保持一致

//使用引号，Phrase查询
GET /movies/_search?q=title:"Beautiful Mind"
{
   "profile": "true"
}

d.分组查询

//分组，Bool查询
GET /movies/_search?q=title:(Beautiful Mind)
{
   "profile": "true"
}

必须包含Beautiful和Mind

//查找美丽心灵
GET /movies/_search?q=title:(Beautiful AND Mind)
{
   "profile": "true"
}

//查找美丽心灵
GET /movies/_search?q=title:(Beautiful %2BMind)
{
   "profile": "true"
}

必须包含Beautiful不包含Mind

//查找美丽心灵
GET /movies/_search?q=title:(Beautiful NOT Mind)
{
   "profile": "true"
}

e.范围查询

年份大于1980

//范围查询，区间写法/数学写法
GET /movies/_search?q=year:>=1980
{
   "profile": "true"
}

f.通配符查询

?代表1个字符，*代表0个或多个字符

title:mi?d

title:be*

四、Requestbody与QueryDSL以及QueryString&SimpleQueryString查询

1.Request Body Search

将查询语句通过HTTP Requedt Body发送给Elasticsearch
Query DSL

2.查询表达式——Match

POST /movies/_search
{
  "query": {
    "match": {
      "title": "Last Christmas"
    }
  }
}

POST /movies/_search
{
  "query": {
    "match": {
      "title":{
        "query": "Last Christmas",
        "operator": "and"
      }
    }
  }
}

3.短语搜索------Match Phrase

POST /movies/_search
{
  "query": {
    "match_phrase": {
      "title":{
     // 字符按照下列顺序出现
        "query": "one love",
     //中间可以出现其他字符
        "slop": 1
      }
    }
  }
}