ElasticSearch分词器、相关性详解与聚合查询实战

GET _analyze
{
  "filter" : ["lowercase"],
  "text" : "WWW ELASTIC ORG CN"
}

GET _analyze
{
  "tokenizer" : "standard",
  "filter" : ["uppercase"],
  "text" : ["www.elastic.org.cn","www elastic org cn"]
}

运行结果

{
  "tokens" : [
    {
      "token" : "www elastic org cn",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "word",
      "position" : 0
    }
  ]
}

{
  "tokens" : [
    {
      "token" : "WWW.ELASTIC.ORG.CN",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "WWW",
      "start_offset" : 19,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 101
    },
    {
      "token" : "ELASTIC",
      "start_offset" : 23,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 102
    },
    {
      "token" : "ORG",
      "start_offset" : 31,
      "end_offset" : 34,
      "type" : "<ALPHANUM>",
      "position" : 103
    },
    {
      "token" : "CN",
      "start_offset" : 35,
      "end_offset" : 37,
      "type" : "<ALPHANUM>",
      "position" : 104
    }
  ]
}

停用词

在切词完成之后，会被干掉词项，即停用词。停用词可以自定义

英文停用词（english）：a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with。

中日韩停用词（cjk）：a, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, s, such, t, that, the, their, then, there, these, they, this, to, was, will, with, www。

GET _analyze
{
  "tokenizer": "standard", 
  "filter": ["stop"],
  "text": ["What are you doing"]
}

### 自定义 filter
DELETE test_token_filter_stop
PUT test_token_filter_stop
{
  "settings": {
    "analysis": {
      "filter": {
        "my_filter": {
          "type": "stop",
          "stopwords": [
            "www"
          ],
          "ignore_case": true
        }
      }
    }
  }
}
GET test_token_filter_stop/_analyze
{
  "tokenizer": "standard", 
  "filter": ["my_filter"], 
  "text": ["What www WWW are you doing"]
}

运行结果

第一个GET
{
  "tokens" : [
    {
      "token" : "What",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "you",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "doing",
      "start_offset" : 13,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

第二个GET
{
  "tokens" : [
    {
      "token" : "What",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "are",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "you",
      "start_offset" : 17,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "doing",
      "start_offset" : 21,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}

同义词

同义词定义规则

a, b, c => d：这种方式，a、b、c 会被 d 代替。
a, b, c, d：这种方式下，a、b、c、d 是等价的。

PUT test_token_filter_synonym
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym": {
          "type": "synonym",
          "synonyms": [ "good, nice => excellent" ] //good, nice, excellent
        }
      }
    }
  }
}
GET test_token_filter_synonym/_analyze
{
  "tokenizer": "standard", 
  "filter": ["my_synonym"], 
  "text": ["good"]
}

运行结果

{
  "tokens" : [
    {
      "token" : "excellent",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "SYNONYM",
      "position" : 0
    }
  ]
}

字符过滤器：Character Filter

分词之前的预处理，过滤无用字符。

PUT <index_name>
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "<char_filter_type>"
        }
      }
    }
  }
}

type：使用的字符过滤器类型名称，可配置以下值：

html_strip
mapping
pattern_replace

HTML 标签过滤器：HTML Strip Character Filter

字符过滤器会去除 HTML 标签和转义 HTML 元素，如、&

PUT test_html_strip_filter
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip",  // html_strip 代表使用 HTML 标签过滤器
          "escaped_tags": [     // 当前仅保留 a 标签        
            "a"
          ]
        }
      }
    }
  }
}
GET test_html_strip_filter/_analyze
{
  "tokenizer": "standard", 
  "char_filter": ["my_char_filter"],
  "text": ["<p>I&apos;m so <a>happy</a>!</p>"]
}

运行结果

{
  "tokens" : [
    {
      "token" : "I'm",
      "start_offset" : 3,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "so",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "happy",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "a",
      "start_offset" : 25,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

参数：escaped_tags：需要保留的 html 标签

字符映射过滤器：Mapping Character Filter

通过定义映射替换为规则，把特定字符替换为指定字符

PUT test_html_strip_filter
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",    // mapping 代表使用字符映射过滤器
          "mappings": [         // 数组中规定的字符会被等价替换为 => 指定的字符
            "滚 => *",
            "垃 => *",
            "圾 => *"
          ]
        }
      }
    }
  }
}
GET test_html_strip_filter/_analyze
{
  //"tokenizer": "standard", 
  "char_filter": ["my_char_filter"],
  "text": "你就是个垃圾！滚"
}

运行结果

{
  "tokens" : [
    {
      "token" : "你就是个**！*",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "word",
      "position" : 0
    }
  ]
}

正则替换过滤器：Pattern Replace Character Filter

PUT text_pattern_replace_filter
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",    // pattern_replace 代表使用正则替换过滤器            
          "pattern": """(\d{3})\d{4}(\d{4})""",    // 正则表达式
          "replacement": "$1****$2"
        }
      }
    }
  }
}
GET text_pattern_replace_filter/_analyze
{
  "char_filter": ["my_char_filter"],
  "text": "您的手机号是18868686688"
}

运行结果

{
  "tokens" : [
    {
      "token" : "您的手机号是188****6688",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    }
  ]
}

1.4 倒排索引的数据结构

当数据写入 ES 时，数据将会通过分词被切分为不同的 term，ES 将 term 与其对应的文档列表建立一种映射关系，这种结构就是倒排索引。

如下图所示：

为了进一步提升索引的效率，ES 在 term 的基础上利用 term 的前缀或者后缀构建了 term index, 用于对 term 本身进行索引，ES 实际的索引结构如下图所示：

这样当我们去搜索某个关键词时，ES 首先根据它的前缀或者后缀迅速缩小关键词的在 term dictionary 中的范围，大大减少了磁盘IO的次数。

单词词典（Term Dictionary) ：记录所有文档的单词，记录单词到倒排列表的关联关系
- 常用字典数据结构：lucene字典实现原理 - zhanlijun - 博客园

倒排列表(Posting List)-记录了单词对应的文档结合，由倒排索引项组成
倒排索引项(Posting)：
- 文档ID
- 词频TF–该单词在文档中出现的次数，用于相关性评分
- 位置(Position)-单词在文档中分词的位置。用于短语搜索（match phrase query)
- 偏移(Offset)-记录单词的开始结束位置，实现高亮显示

Elasticsearch 的JSON文档中的每个字段，都有自己的倒排索引。

可以指定对某些字段不做索引：

优点︰节省存储空间
缺点: 字段无法被搜索

2. 相关性详解

搜索是用户和搜索引擎的对话，用户关心的是搜索结果的相关性

是否可以找到所有相关的内容
有多少不相关的内容被返回了
文档的打分是否合理
结合业务需求，平衡结果排名

2.1 什么是相关性（Relevance）

搜索的相关性算分，描述了一个文档和查询语句匹配的程度。ES 会对每个匹配查询条件的结果进行算分_score。打分的本质是排序，需要把最符合用户需求的文档排在前面。

如下例子：显而易见，查询JAVA多线程设计模式，文档id为2，3的文档的算分更高

关键词	文档ID
JAVA	1，2，3
设计模式	1，2，3，4，5，6
多线程	2，3，7，9

如何衡量相关性：

Precision(查准率)―尽可能返回较少的无关文档
Recall(查全率)–尽量返回较多的相关文档
Ranking -是否能够按照相关度进行排序

2.2 相关性算法

ES 5之前，默认的相关性算分采用TF-IDF，现在采用BM 25。

TF-IDF

TF-IDF（term frequency–inverse document frequency）是一种用于信息检索与数据挖掘的常用加权技术。

TF-IDF被公认为是信息检索领域最重要的发明，除了在信息检索，在文献分类和其他相关领域有着非常广泛的应用。
IDF的概念，最早是剑桥大学的“斯巴克.琼斯”提出

- 1972年——“关键词特殊性的统计解释和它在文献检索中的应用”，但是没有从理论上解释IDF应该是用log(全部文档数/检索词出现过的文档总数)，而不是其他函数，也没有做进一步的研究
- 1970，1980年代萨尔顿和罗宾逊，进行了进一步的证明和研究，并用香农信息论做了证明http://www.staff.city.ac.uk/~sb317/papers/foundations_bm25_review.pdf

现代搜索引擎，对TF-IDF进行了大量细微的优化

Lucene中的TF-IDF评分公式：

TF是词频(Term Frequency)

检索词在文档中出现的频率越高，相关性也越高。

词频（TF） = 某个词在文档中出现的次数 / 文档的总词数

IDF是逆向文本频率(Inverse Document Frequency)

每个检索词在索引中出现的频率，频率越高，相关性越低。总文档中有些词比如“是”、“的” 、“在” 在所有文档中出现频率都很高，并不重要，可以减少多个文档中都频繁出现的词的权重。

逆向文本频率（IDF）= log (语料库的文档总数 / (包含该词的文档数+1))

字段长度归一值（ field-length norm）

检索词出现在一个内容短的 title 要比同样的词出现在一个内容长的 content 字段权重更大。

以上三个因素——词频（term frequency）、逆向文本频率（inverse document frequency）和字段长度归一值（field-length norm）——是在索引时计算并存储的，最后将它们结合在一起计算单个词在特定文档中的权重。

BM25

BM25 就是对 TF-IDF 算法的改进，对于 TF-IDF 算法，TF(t) 部分的值越大，整个公式返回的值就会越大。BM25 就针对这点进行来优化，随着TF(t) 的逐步加大，该算法的返回值会趋于一个数值。

从ES 5开始，默认算法改为BM 25
和经典的TF-IDF相比，当TF无限增加时，BM 25算分会趋于一个数值

BM 25的公式

2.3 通过Explain API查看TF-IDF

PUT /test_score/_bulk
{"index":{"_id":1}}
{"content":"we use Elasticsearch to power the search"}
{"index":{"_id":2}}
{"content":"we like elasticsearch"}
{"index":{"_id":3}}
{"content":"Thre scoring of documents is caculated by the scoring formula"}
{"index":{"_id":4}}
{"content":"you know,for search"}

GET /test_score/_search
{
  "explain": true, 
  "query": {
    "match": {
      "content": "elasticsearch"
    }
  }
}

GET /test_score/_explain/2
{
  "query": {
    "match": {
      "content": "elasticsearch"
    }
  }
}

运行结果

{
  "_index" : "test_score",
  "_type" : "_doc",
  "_id" : "2",
  "matched" : true,
  "explanation" : {
    "value" : 0.8713851,
    "description" : "weight(content:elasticsearch in 1) [PerFieldSimilarity], result of:",
    "details" : [
      {
        "value" : 0.8713851,
        "description" : "score(freq=1.0), computed as boost * idf * tf from:",
        "details" : [
          {
            "value" : 2.2,
            "description" : "boost",
            "details" : [ ]
          },
          {
            "value" : 0.6931472,
            "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
            "details" : [
              {
                "value" : 2,
                "description" : "n, number of documents containing term",
                "details" : [ ]
              },
              {
                "value" : 4,
                "description" : "N, total number of documents with field",
                "details" : [ ]
              }
            ]
          },
          {
            "value" : 0.5714286,
            "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
            "details" : [
              {
                "value" : 1.0,
                "description" : "freq, occurrences of term within document",
                "details" : [ ]
              },
              {
                "value" : 1.2,
                "description" : "k1, term saturation parameter",
                "details" : [ ]
              },
              {
                "value" : 0.75,
                "description" : "b, length normalization parameter",
                "details" : [ ]
              },
              {
                "value" : 3.0,
                "description" : "dl, length of field",
                "details" : [ ]
              },
              {
                "value" : 6.0,
                "description" : "avgdl, average length of field",
                "details" : [ ]
              }
            ]
          }
        ]
      }
    ]
  }
}

GET /es_db/_explain/3
{
  "query": {
    "match": {
      "address": "广州公园"
    }
  }
}

运行结果

{
  "_index" : "es_db",
  "_type" : "_doc",
  "_id" : "3",
  "matched" : true,
  "explanation" : {
    "value" : 1.6476591,
    "description" : "sum of:",
    "details" : [
      {
        "value" : 0.5978369,
        "description" : "weight(address:广州 in 2) [PerFieldSimilarity], result of:",
        "details" : [
          {
            "value" : 0.5978369,
            "description" : "score(freq=1.0), computed as boost * idf * tf from:",
            "details" : [
              {
                "value" : 2.2,
                "description" : "boost",
                "details" : [ ]
              },
              {
                "value" : 0.597837,
                "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                "details" : [
                  {
                    "value" : 5,
                    "description" : "n, number of documents containing term",
                    "details" : [ ]
                  },
                  {
                    "value" : 9,
                    "description" : "N, total number of documents with field",
                    "details" : [ ]
                  }
                ]
              },
              {
                "value" : 0.45454544,
                "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                "details" : [
                  {
                    "value" : 1.0,
                    "description" : "freq, occurrences of term within document",
                    "details" : [ ]
                  },
                  {
                    "value" : 1.2,
                    "description" : "k1, term saturation parameter",
                    "details" : [ ]
                  },
                  {
                    "value" : 0.75,
                    "description" : "b, length normalization parameter",
                    "details" : [ ]
                  },
                  {
                    "value" : 5.0,
                    "description" : "dl, length of field",
                    "details" : [ ]
                  },
                  {
                    "value" : 5.0,
                    "description" : "avgdl, average length of field",
                    "details" : [ ]
                  }
                ]
              }
            ]
          }
        ]
      },
      {
        "value" : 1.0498221,
        "description" : "weight(address:公园 in 2) [PerFieldSimilarity], result of:",
        "details" : [
          {
            "value" : 1.0498221,
            "description" : "score(freq=1.0), computed as boost * idf * tf from:",
            "details" : [
              {
                "value" : 2.2,
                "description" : "boost",
                "details" : [ ]
              },
              {
                "value" : 1.0498221,
                "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                "details" : [
                  {
                    "value" : 3,
                    "description" : "n, number of documents containing term",
                    "details" : [ ]
                  },
                  {
                    "value" : 9,
                    "description" : "N, total number of documents with field",
                    "details" : [ ]
                  }
                ]
              },
              {
                "value" : 0.45454544,
                "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                "details" : [
                  {
                    "value" : 1.0,
                    "description" : "freq, occurrences of term within document",
                    "details" : [ ]
                  },
                  {
                    "value" : 1.2,
                    "description" : "k1, term saturation parameter",
                    "details" : [ ]
                  },
                  {
                    "value" : 0.75,
                    "description" : "b, length normalization parameter",
                    "details" : [ ]
                  },
                  {
                    "value" : 5.0,
                    "description" : "dl, length of field",
                    "details" : [ ]
                  },
                  {
                    "value" : 5.0,
                    "description" : "avgdl, average length of field",
                    "details" : [ ]
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

2.4 Boosting Query(常用)

Boosting是控制相关度的一种手段。可以通过指定字段的boost值影响查询结果

参数boost的含义：

当boost > 1时，打分的权重相对性提升
当0 < boost <1时，打分的权重相对性降低
当boost <0时，贡献负分

应用场景：希望包含了某项内容的结果不是不出现，而是排序靠后。

POST /blogs/_bulk
{"index":{"_id":1}}
{"title":"Apple iPad","content":"Apple iPad,Apple iPad"}
{"index":{"_id":2}}
{"title":"Apple iPad,Apple iPad","content":"Apple iPad"}

GET /blogs/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": {
              "query": "apple,ipad",
              "boost": 1
            }
          }
        },
        {
          "match": {
            "content": {
              "query": "apple,ipad",
              "boost": 4
            }
          }
        }
      ]
    }
  }
}

运行结果

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 2.2558527,
    "hits" : [
      {
        "_index" : "blogs",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.2558527,
        "_source" : {
          "title" : "Apple iPad",
          "content" : "Apple iPad,Apple iPad"
        }
      },
      {
        "_index" : "blogs",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 2.1472821,
        "_source" : {
          "title" : "Apple iPad,Apple iPad",
          "content" : "Apple iPad"
        }
      }
    ]
  }
}

案例：要求苹果公司的产品信息优先展示

POST /news/_bulk
{"index":{"_id":1}}
{"content":"Apple Mac"}
{"index":{"_id":2}}
{"content":"Apple iPad"}
{"index":{"_id":3}}
{"content":"Apple employee like Apple Pie and Apple Juice"}


GET /news/_search
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "content": "apple"
        }
      }
    }
  }
}

运行结果

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 0.17280531,
    "hits" : [
      {
        "_index" : "news",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.17280531,
        "_source" : {
          "content" : "Apple employee like Apple Pie and Apple Juice"
        }
      },
      {
        "_index" : "news",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.16786805,
        "_source" : {
          "content" : "Apple Mac"
        }
      },
      {
        "_index" : "news",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.16786805,
        "_source" : {
          "content" : "Apple iPad"
        }
      }
    ]
  }
}

利用must not排除不是苹果公司产品的文档

GET /news/_search
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "content": "apple"
        }
      },
      "must_not": {
        "match":{
          "content": "pie"
        }
      }
    }
  }
}

运行结果

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.16786805,
    "hits" : [
      {
        "_index" : "news",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.16786805,
        "_source" : {
          "content" : "Apple Mac"
        }
      },
      {
        "_index" : "news",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.16786805,
        "_source" : {
          "content" : &