es搜索核心与实战 Day02
一、倒排索引
1.搜索引擎
- 正排索引——文档ld到文档内容和单词的关联+
- 倒排索引——单词到文档Id的关系
2。倒排索引的核心组成
倒排索引包含两个部分
- 单词词典 (Term Dictionary), 记录所有文档的单词,记录单词到倒排列表的关联关系
单词词典一般比较大,可以通过B +树或哈希拉链法实现,以满足高性能的插入与查询
- 倒排列表(Posting List) - 记录了单词对应的文档结合,由倒排索引项组成
- 倒排索引项(Posting)
1.文档ID
2.词频TF-该单词在文档中出现的次数,用于相关性评分
3.位置(Position) -单词在文档中分词的位置。用于语句搜索(phrase query)
4.偏移(Offset) -记录单词的开始结束位置,实现高亮显示
二、通过Analyzer进行分词
GET _analyze
{
//
"analyzer": standard
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
1.standard analyzer
- 默认分词器
- 按词切分
- 小写处理
返回结果
{
"tokens" : [
{
"token" : "2",//返回结果值
"start_offset" : 0,//结果值开始位置
"end_offset" : 1,//结果值结束位置
"type" : "<NUM>",//结果值类型
"position" : 0//第几个
},
{
"token" : "running",
"start_offset" : 2,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "quick",
"start_offset" : 10,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "brown",
"start_offset" : 16,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "foxes",
"start_offset" : 22,
"end_offset" : 27,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "leap",
"start_offset" : 28,
"end_offset" : 32,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "over",
"start_offset" : 33,
"end_offset" : 37,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "lazy",
"start_offset" : 38,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "dogs",
"start_offset" : 43,
"end_offset" : 47,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "in",
"start_offset" : 48,
"end_offset" : 50,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "the",
"start_offset" : 51,
"end_offset" : 54,
"type" : "<ALPHANUM>",
"position" : 10
},
{
"token" : "summer",
"start_offset" : 55,
"end_offset" : 61,
"type" : "<ALPHANUM>",
"position" : 11
},
{
"token" : "evening",
"start_offset" : 62,
"end_offset" : 69,
"type" : "<ALPHANUM>",
"position" : 12
}
]
}
2.simple analyzer
- 按照非字母切分
- 非字母的都被去除
- 小写处理
返回结果
{
"tokens" : [
{
"token" : "running",
"start_offset" : 2,
"end_offset" : 9,
"type" : "word",
"position" : 0
},
{
"token" : "quick",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 1
},
{
"token" : "brown",
"start_offset" : 16,
"end_offset" : 21,
"type" : "word",
"position" : 2
},
{
"token" : "foxes",
"start_offset" : 22,
"end_offset" : 27,
"type" : "word",
"position" : 3
},
{
"token" : "leap",
"start_offset" : 28,
"end_offset" : 32,
"type" : "word",
"position" : 4
},
{
"token" : "over",
"start_offset" : 33,
"end_offset" : 37,
"type" : "word",
"position" : 5
},
{
"token" : "lazy",
"start_offset" : 38,
"end_offset" : 42,
"type" : "word",
"position" : 6
},
{
"token" : "dogs",
"start_offset" : 43,
"end_offset" : 47,
"type" : "word",
"position" : 7
},
{
"token" : "in",
"start_offset" : 48,
"end_offset" : 50,
"type" : "word",
"position" : 8
},
{
"token" : "the",
"start_offset" : 51,
"end_offset" : 54,
"type" : "word",
"position" : 9
},
{
"token" : "summer",
"start_offset" : 55,
"end_offset" : 61,
"type" : "word",
"position" : 10
},
{
"token" : "evening",
"start_offset" : 62,
"end_offset" : 69,
"type" : "word",
"position" : 11
}
]
}
3.whitespace analyzer
- 按空格切分
返回结果
{
"tokens" : [
{
"token" : "2",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "running",
"start_offset" : 2,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "Quick",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 2
},
{
"token" : "brown-foxes",
"start_offset" : 16,
"end_offset" : 27,
"type" : "word",
"position" : 3
},
{
"token" : "leap",
"start_offset" : 28,
"end_offset" : 32,
"type" : "word",
"position" : 4
},
{
"token" : "over",
"start_offset" : 33,
"end_offset" : 37,
"type" : "word",
"position" : 5
},
{
"token" : "lazy",
"start_offset" : 38,
"end_offset" : 42,
"type" : "word",
"position" : 6
},
{
"token" : "dogs",
"start_offset" : 43,
"end_offset" : 47,
"type" : "word",
"position" : 7
},
{
"token" : "in",
"start_offset" : 48,
"end_offset" : 50,
"type" : "word",
"position" : 8
},
{
"token" : "the",
"start_offset" : 51,
"end_offset" : 54,
"type" : "word",
"position" : 9
},
{
"token" : "summer",
"start_offset" : 55,
"end_offset" : 61,
"type" : "word",
"position" : 10
},
{
"token" : "evening.",
"start_offset" : 62,
"end_offset" : 70,
"type" : "word",
"position" : 11
}
]
}
4.stop analyzer
-
相比Simple Analyzer
-
多了stop filter
-
会把the,a,is等修饰性词语去除
返回结果
{
"tokens" : [
{
"token" : "running",
"start_offset" : 2,
"end_offset" : 9,
"type" : "word",
"position" : 0
},
{
"token" : "quick",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 1
},
{
"token" : "brown",
"start_offset" : 16,
"end_offset" : 21,
"type" : "word",
"position" : 2
},
{
"token" : "foxes",
"start_offset" : 22,
"end_offset" : 27,
"type" : "word",
"position" : 3
},
{
"token" : "leap",
"start_offset" : 28,
"end_offset" : 32,
"type" : "word",
"position" : 4
},
{
"token" : "over",
"start_offset" : 33,
"end_offset" : 37,
"type" : "word",
"position" : 5
},
{
"token" : "lazy",
"start_offset" : 38,
"end_offset" : 42,
"type" : "word",
"position" : 6
},
{
"token" : "dogs",
"start_offset" : 43,
"end_offset" : 47,
"type" : "word",
"position" : 7
},
{
"token" : "summer",
"start_offset" : 55,
"end_offset" : 61,
"type" : "word",
"position" : 10
},
{
"token" : "evening",
"start_offset" : 62,
"end_offset" : 69,
"type" : "word",
"position" : 11
}
]
}
5.keyword analyzer
不分词,直接将输入当成一个term输出
6.pattern analyzer
- 通过正则表达式进行分词
- 默认是\W+,非字符的符号进行分隔
7.english analyzer
三、SearchAPI及URISearch详解
1.URI Search
- 在URI中使用查询参数
2.Request Body Search
- 使用Elasticsearch提供的,基于JSON格式的更加完备的Query Domain Specific Language (DSL)
3.搜索的相关性Relevance
- 搜索是用户和搜索引擎的对话
- 用户关心的是搜索结果的相关性
- 是否可以找到所有相关的内容
- 有多少不相关的内容被返回了
- 文档的打分是否合理
- 结合业务需求,平衡结果排名
Page Rank算法
- 不仅仅是内容
- 更重要的是内容的可信度
4.衡量相关性
- Information Retrieval
- Precision (查准率) -尽可能返回较少的无关文档
- Recall (查全率) -尽量返回较多的相关文档
- Ranking -是否能够按照相关度进行排序?
5.URISearch
a.指定字段
查询出指定字段(title)值为2012的数据
GET /movies/_search?q=2012&df=title
{
"profile": "true"
}
b.泛查询
查询出任意字段值为2012的数据
GET /movies/_search?q=2012
{
"profile": "true"
}
c.Term and Phrase
- Beautiful Mind等效于Beautiful OR Mind
- “Beautiful Mind”,等效于Beautiful AND Mind。Phrase 查询,还要求前后顺序保持一致
//使用引号,Phrase查询
GET /movies/_search?q=title:"Beautiful Mind"
{
"profile": "true"
}
d.分组查询
//分组,Bool查询
GET /movies/_search?q=title:(Beautiful Mind)
{
"profile": "true"
}
必须包含Beautiful和Mind
//查找美丽心灵
GET /movies/_search?q=title:(Beautiful AND Mind)
{
"profile": "true"
}
//查找美丽心灵
GET /movies/_search?q=title:(Beautiful %2BMind)
{
"profile": "true"
}
必须包含Beautiful不包含Mind
//查找美丽心灵
GET /movies/_search?q=title:(Beautiful NOT Mind)
{
"profile": "true"
}
e.范围查询
年份大于1980
//范围查询,区间写法/数学写法
GET /movies/_search?q=year:>=1980
{
"profile": "true"
}
f.通配符查询
- ?代表1个字符,*代表0个或多个字符
title:mi?d
title:be*
四、Requestbody与QueryDSL以及QueryString&SimpleQueryString查询
1.Request Body Search
- 将查询语句通过HTTP Requedt Body发送给Elasticsearch
- Query DSL
2.查询表达式——Match
POST /movies/_search
{
"query": {
"match": {
"title": "Last Christmas"
}
}
}
POST /movies/_search
{
"query": {
"match": {
"title":{
"query": "Last Christmas",
"operator": "and"
}
}
}
}
3.短语搜索------Match Phrase
POST /movies/_search
{
"query": {
"match_phrase": {
"title":{
// 字符按照下列顺序出现
"query": "one love",
//中间可以出现其他字符
"slop": 1
}
}
}
}
4.Simple Query String Query
- 类似 Query String,但是会忽略错误的语法,同时只支持部分查询语法
- 不支持AND OR NOT,会当作字符串处理.
- Term 之间默认的关系是OR,可以指定Operator
- 支持部分逻辑
+替代AND
| 替代OR
-替代NOT