目录
启用 eager global ordinals 提升高基数聚合性能
1. ES分词器详解
1.1 基本概念
分词器官方称之为文本分析器,顾名思义,是对文本进行分析处理的一种手段,基本处理逻辑为按照预先制定的分词规则,把原始文档分割成若干更小粒度的词项,粒度大小取决于分词器规则。
1.2 分词发生时期
分词器的处理过程发生在 Index Time 和 Search Time 两个时期。
- Index Time:文档写入并创建倒排索引时期,其分词逻辑取决于映射参数analyzer。
- Search Time:搜索发生时期,其分词仅对搜索词产生作用。
1.3 分词器的组成
- 切词器(Tokenizer):用于定义切词(分词)逻辑
- 词项过滤器(Token Filter):用于对分词之后的单个词项的处理逻辑
- 字符过滤器(Character Filter):用于处理单个字符
注意:
- 分词器不会对源数据造成任何影响,分词仅仅是对倒排索引或者搜索词的行为。
切词器:Tokenizer
tokenizer 是分词器的核心组成部分之一,其主要作用是分词,或称之为切词。主要用来对原始文本进行细粒度拆分。拆分之后的每一个部分称之为一个 Term,或称之为一个词项。可以把切词器理解为预定义的切词规则。官方内置了很多种切词器,默认的切词器为 standard。
词项过滤器:Token Filter
词项过滤器用来处理切词完成之后的词项,例如把大小写转换,删除停用词或同义词处理等。官方同样预置了很多词项过滤器,基本可以满足日常开发的需要。当然也是支持第三方也自行开发的。
GET _analyze
{
"filter" : ["lowercase"],
"text" : "WWW ELASTIC ORG CN"
}
GET _analyze
{
"tokenizer" : "standard",
"filter" : ["uppercase"],
"text" : ["www.elastic.org.cn","www elastic org cn"]
}
运行结果
{
"tokens" : [
{
"token" : "www elastic org cn",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 0
}
]
}
{
"tokens" : [
{
"token" : "WWW.ELASTIC.ORG.CN",
"start_offset" : 0,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "WWW",
"start_offset" : 19,
"end_offset" : 22,
"type" : "<ALPHANUM>",
"position" : 101
},
{
"token" : "ELASTIC",
"start_offset" : 23,
"end_offset" : 30,
"type" : "<ALPHANUM>",
"position" : 102
},
{
"token" : "ORG",
"start_offset" : 31,
"end_offset" : 34,
"type" : "<ALPHANUM>",
"position" : 103
},
{
"token" : "CN",
"start_offset" : 35,
"end_offset" : 37,
"type" : "<ALPHANUM>",
"position" : 104
}
]
}
停用词
在切词完成之后,会被干掉词项,即停用词。停用词可以自定义
英文停用词(english):a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with。
中日韩停用词(cjk):a, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, s, such, t, that, the, their, then, there, these, they, this, to, was, will, with, www。
GET _analyze
{
"tokenizer": "standard",
"filter": ["stop"],
"text": ["What are you doing"]
}
### 自定义 filter
DELETE test_token_filter_stop
PUT test_token_filter_stop
{
"settings": {
"analysis": {
"filter": {
"my_filter": {
"type": "stop",
"stopwords": [
"www"
],
"ignore_case": true
}
}
}
}
}
GET test_token_filter_stop/_analyze
{
"tokenizer": "standard",
"filter": ["my_filter"],
"text": ["What www WWW are you doing"]
}
运行结果
第一个GET
{
"tokens" : [
{
"token" : "What",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "you",
"start_offset" : 9,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "doing",
"start_offset" : 13,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
第二个GET
{
"tokens" : [
{
"token" : "What",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "are",
"start_offset" : 13,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "you",
"start_offset" : 17,
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "doing",
"start_offset" : 21,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 5
}
]
}
同义词
同义词定义规则
- a, b, c => d:这种方式,a、b、c 会被 d 代替。
- a, b, c, d:这种方式下,a、b、c、d 是等价的。
PUT test_token_filter_synonym
{
"settings": {
"analysis": {
"filter": {
"my_synonym": {
"type": "synonym",
"synonyms": [ "good, nice => excellent" ] //good, nice, excellent
}
}
}
}
}
GET test_token_filter_synonym/_analyze
{
"tokenizer": "standard",
"filter": ["my_synonym"],
"text": ["good"]
}
运行结果
{
"tokens" : [
{
"token" : "excellent",
"start_offset" : 0,
"end_offset" : 4,
"type" : "SYNONYM",
"position" : 0
}
]
}
字符过滤器:Character Filter
分词之前的预处理,过滤无用字符。
PUT <index_name>
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "<char_filter_type>"
}
}
}
}
}
type:使用的字符过滤器类型名称,可配置以下值:
- html_strip
- mapping
- pattern_replace
HTML 标签过滤器:HTML Strip Character Filter
字符过滤器会去除 HTML 标签和转义 HTML 元素,如 、&
PUT test_html_strip_filter
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "html_strip", // html_strip 代表使用 HTML 标签过滤器
"escaped_tags": [ // 当前仅保留 a 标签
"a"
]
}
}
}
}
}
GET test_html_strip_filter/_analyze
{
"tokenizer": "standard",
"char_filter": ["my_char_filter"],
"text": ["<p>I'm so <a>happy</a>!</p>"]
}
运行结果
{
"tokens" : [
{
"token" : "I'm",
"start_offset" : 3,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "so",
"start_offset" : 12,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "a",
"start_offset" : 16,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "happy",
"start_offset" : 18,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "a",
"start_offset" : 25,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
参数:escaped_tags:需要保留的 html 标签
字符映射过滤器:Mapping Character Filter
通过定义映射替换为规则,把特定字符替换为指定字符
PUT test_html_strip_filter
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping", // mapping 代表使用字符映射过滤器
"mappings": [ // 数组中规定的字符会被等价替换为 => 指定的字符
"滚 => *",
"垃 => *",
"圾 => *"
]
}
}
}
}
}
GET test_html_strip_filter/_analyze
{
//"tokenizer": "standard",
"char_filter": ["my_char_filter"],
"text": "你就是个垃圾!滚"
}
运行结果
{
"tokens" : [
{
"token" : "你就是个**!*",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 0
}
]
}
正则替换过滤器:Pattern Replace Character Filter
PUT text_pattern_replace_filter
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "pattern_replace", // pattern_replace 代表使用正则替换过滤器
"pattern": """(\d{3})\d{4}(\d{4})""", // 正则表达式
"replacement": "$1****$2"
}
}
}
}
}
GET text_pattern_replace_filter/_analyze
{
"char_filter": ["my_char_filter"],
"text": "您的手机号是18868686688"
}
运行结果
{
"tokens" : [
{
"token" : "您的手机号是188****6688",
"start_offset" : 0,
"end_offset" : 17,
"type" : "word",
"position" : 0
}
]
}
1.4 倒排索引的数据结构
当数据写入 ES 时,数据将会通过 分词 被切分为不同的 term,ES 将 term 与其对应的文档列表建立一种映射关系,这种结构就是 倒排索引。
如下图所示:
为了进一步提升索引的效率,ES 在 term 的基础上利用 term 的前缀或者后缀构建了 term index, 用于对 term 本身进行索引,ES 实际的索引结构如下图所示:
这样当我们去搜索某个关键词时,ES 首先根据它的前缀或者后缀迅速缩小关键词的在 term dictionary 中的范围,大大减少了磁盘IO的次数。
- 单词词典(Term Dictionary) :记录所有文档的单词,记录单词到倒排列表的关联关系
- 常用字典数据结构:lucene字典实现原理 - zhanlijun - 博客园
- 倒排列表(Posting List)-记录了单词对应的文档结合,由倒排索引项组成
- 倒排索引项(Posting):
- 文档ID
- 词频TF–该单词在文档中出现的次数,用于相关性评分
- 位置(Position)-单词在文档中分词的位置。用于短语搜索(match phrase query)
- 偏移(Offset)-记录单词的开始结束位置,实现高亮显示
Elasticsearch 的JSON文档中的每个字段,都有自己的倒排索引。
可以指定对某些字段不做索引:
- 优点︰节省存储空间
- 缺点: 字段无法被搜索
2. 相关性详解
搜索是用户和搜索引擎的对话,用户关心的是搜索结果的相关性
- 是否可以找到所有相关的内容
- 有多少不相关的内容被返回了
- 文档的打分是否合理
- 结合业务需求,平衡结果排名
2.1 什么是相关性(Relevance)
搜索的相关性算分,描述了一个文档和查询语句匹配的程度。ES 会对每个匹配查询条件的结果进行算分_score。打分的本质是排序,需要把最符合用户需求的文档排在前面。
如下例子:显而易见,查询JAVA多线程设计模式,文档id为2,3的文档的算分更高
关键词 |
文档ID |
JAVA |
1,2,3 |
设计模式 |
1,2,3,4,5,6 |
多线程 |
2,3,7,9 |
如何衡量相关性:
- Precision(查准率)―尽可能返回较少的无关文档
- Recall(查全率)–尽量返回较多的相关文档
- Ranking -是否能够按照相关度进行排序
2.2 相关性算法
ES 5之前,默认的相关性算分采用TF-IDF,现在采用BM 25。
TF-IDF
TF-IDF(term frequency–inverse document frequency)是一种用于信息检索与数据挖掘的常用加权技术。
- TF-IDF被公认为是信息检索领域最重要的发明,除了在信息检索,在文献分类和其他相关领域有着非常广泛的应用。
- IDF的概念,最早是剑桥大学的“斯巴克.琼斯”提出
-
- 1972年——“关键词特殊性的统计解释和它在文献检索中的应用”,但是没有从理论上解释IDF应该是用log(全部文档数/检索词出现过的文档总数),而不是其他函数,也没有做进一步的研究
- 1970,1980年代萨尔顿和罗宾逊,进行了进一步的证明和研究,并用香农信息论做了证明http://www.staff.city.ac.uk/~sb317/papers/foundations_bm25_review.pdf
- 现代搜索引擎,对TF-IDF进行了大量细微的优化
Lucene中的TF-IDF评分公式:
TF是词频(Term Frequency)
检索词在文档中出现的频率越高,相关性也越高。
词频(TF) = 某个词在文档中出现的次数 / 文档的总词数
IDF是逆向文本频率(Inverse Document Frequency)
每个检索词在索引中出现的频率,频率越高,相关性越低。总文档中有些词比如“是”、“的” 、“在” 在所有文档中出现频率都很高,并不重要,可以减少多个文档中都频繁出现的词的权重。
逆向文本频率(IDF)= log (语料库的文档总数 / (包含该词的文档数+1))
字段长度归一值( field-length norm)
检索词出现在一个内容短的 title 要比同样的词出现在一个内容长的 content 字段权重更大。
以上三个因素——词频(term frequency)、逆向文本频率(inverse document frequency)和字段长度归一值(field-length norm)——是在索引时计算并存储的,最后将它们结合在一起计算单个词在特定文档中的权重。
BM25
BM25 就是对 TF-IDF 算法的改进,对于 TF-IDF 算法,TF(t) 部分的值越大,整个公式返回的值就会越大。BM25 就针对这点进行来优化,随着TF(t) 的逐步加大,该算法的返回值会趋于一个数值。
- 从ES 5开始,默认算法改为BM 25
- 和经典的TF-IDF相比,当TF无限增加时,BM 25算分会趋于一个数值
- BM 25的公式
2.3 通过Explain API查看TF-IDF
PUT /test_score/_bulk
{"index":{"_id":1}}
{"content":"we use Elasticsearch to power the search"}
{"index":{"_id":2}}
{"content":"we like elasticsearch"}
{"index":{"_id":3}}
{"content":"Thre scoring of documents is caculated by the scoring formula"}
{"index":{"_id":4}}
{"content":"you know,for search"}
GET /test_score/_search
{
"explain": true,
"query": {
"match": {
"content": "elasticsearch"
}
}
}
GET /test_score/_explain/2
{
"query": {
"match": {
"content": "elasticsearch"
}
}
}
运行结果
{
"_index" : "test_score",
"_type" : "_doc",
"_id" : "2",
"matched" : true,
"explanation" : {
"value" : 0.8713851,
"description" : "weight(content:elasticsearch in 1) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.8713851,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.6931472,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 4,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.5714286,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 3.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 6.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
}
GET /es_db/_explain/3
{
"query": {
"match": {
"address": "广州公园"
}
}
}
运行结果
{
"_index" : "es_db",
"_type" : "_doc",
"_id" : "3",
"matched" : true,
"explanation" : {
"value" : 1.6476591,
"description" : "sum of:",
"details" : [
{
"value" : 0.5978369,
"description" : "weight(address:广州 in 2) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.5978369,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.597837,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 5,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 9,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 5.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 5.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 1.0498221,
"description" : "weight(address:公园 in 2) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 1.0498221,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 1.0498221,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 3,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 9,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 5.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 5.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
}
2.4 Boosting Query(常用)
Boosting是控制相关度的一种手段。可以通过指定字段的boost值影响查询结果
参数boost的含义:
- 当boost > 1时,打分的权重相对性提升
- 当0 < boost <1时,打分的权重相对性降低
- 当boost <0时,贡献负分
应用场景:希望包含了某项内容的结果不是不出现,而是排序靠后。
POST /blogs/_bulk
{"index":{"_id":1}}
{"title":"Apple iPad","content":"Apple iPad,Apple iPad"}
{"index":{"_id":2}}
{"title":"Apple iPad,Apple iPad","content":"Apple iPad"}
GET /blogs/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"title": {
"query": "apple,ipad",
"boost": 1
}
}
},
{
"match": {
"content": {
"query": "apple,ipad",
"boost": 4
}
}
}
]
}
}
}
运行结果
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 2.2558527,
"hits" : [
{
"_index" : "blogs",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.2558527,
"_source" : {
"title" : "Apple iPad",
"content" : "Apple iPad,Apple iPad"
}
},
{
"_index" : "blogs",
"_type" : "_doc",
"_id" : "2",
"_score" : 2.1472821,
"_source" : {
"title" : "Apple iPad,Apple iPad",
"content" : "Apple iPad"
}
}
]
}
}
案例:要求苹果公司的产品信息优先展示
POST /news/_bulk
{"index":{"_id":1}}
{"content":"Apple Mac"}
{"index":{"_id":2}}
{"content":"Apple iPad"}
{"index":{"_id":3}}
{"content":"Apple employee like Apple Pie and Apple Juice"}
GET /news/_search
{
"query": {
"bool": {
"must": {
"match": {
"content": "apple"
}
}
}
}
}
运行结果
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : 0.17280531,
"hits" : [
{
"_index" : "news",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.17280531,
"_source" : {
"content" : "Apple employee like Apple Pie and Apple Juice"
}
},
{
"_index" : "news",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.16786805,
"_source" : {
"content" : "Apple Mac"
}
},
{
"_index" : "news",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.16786805,
"_source" : {
"content" : "Apple iPad"
}
}
]
}
}
利用must not排除不是苹果公司产品的文档
GET /news/_search
{
"query": {
"bool": {
"must": {
"match": {
"content": "apple"
}
},
"must_not": {
"match":{
"content": "pie"
}
}
}
}
}
运行结果
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.16786805,
"hits" : [
{
"_index" : "news",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.16786805,
"_source" : {
"content" : "Apple Mac"
}
},
{
"_index" : "news",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.16786805,
"_source" : {
"content" : &