Elasticsearch 之（12）query string的分词，修改分词器以及自定义分词器

最新推荐文章于 2024-01-28 14:04:56 发布

夏目 "

最新推荐文章于 2024-01-28 14:04:56 发布

阅读量5.7k

点赞数

分类专栏： Elasticsearch Elasticsearch 文章标签： elasticsearch kibana

本文链接：https://blog.csdn.net/wuzhiwei549/article/details/80395015

版权

Elasticsearch 同时被 2 个专栏收录

56 篇文章 38 订阅

订阅专栏

Elasticsearch

55 篇文章 11 订阅

订阅专栏

query string分词

  query string必须以和index建立时相同的analyzer进行分词 

  query string对exact value和full text的区别对待 （ 
 第10节中详细阐述过） 

  date：exact value 

  _all：full text 

  比如我们有一个document，其中有一个field，包含的value是：hello you and me，建立倒排索引 

  我们要搜索这个document对应的index，搜索文本是hell me，这个搜索文本就是query string 

  query string，默认情况下，es会使用它对应的field建立倒排索引时相同的分词器去进行分词，分词和normalization，只有这样，才能实现正确的搜索 

  我们建立倒排索引的时候，将dogs --> dog，结果你搜索的时候，还是一个dogs，那不就搜索不到了吗？所以搜索的时候，那个dogs也必须变成dog才行。才能搜索到。 

  知识点：不同类型的field，可能有的就是full text，有的就是exact value 

  post_date，date：exact value 

  _all：full text，分词，normalization 

分词器使用

  GET /_search?q=2017 

  搜索的是_all field，document所有的field都会拼接成一个大串，进行分词 

  2017-01-02 my second article this is my second article in this website 11400 

  doc1 doc2 doc3 

  2017 * * * 

  01 * 

  02 * 

  03 * 

  _all，2017，自然会搜索到3个docuemnt 

  GET /_search?q=2017-01-01 

  _all，2017-01-01，query string会用跟建立倒排索引一样的分词器去进行分词 

2017

01

01

  GET /_search?q=post_date:2017-01-01 

  date，会作为exact value去建立索引 

      doc1     doc2      doc3 

  2017-01-01 * 

  2017-01-02 * 

  2017-01-03 * 

  post_date:2017-01-01，2017-01-01，doc1一条document 

  GET /_search?q=post_date:2017，这个在这里不讲解，因为是es 5.2以后做的一个优化 

测试分词器

GET /_analyze
{
  "analyzer": "standard",
  "text": "Text to analyze"
}

  （1）往es里面直接插入数据，es会自动建立索引，同时建立type以及对应的mapping 

  （2）mapping中就自动定义了每个field的数据类型 

  （3）不同的数据类型（比如说text和date），可能有的是exact value，有的是full text 

  （4）exact value，在建立倒排索引的时候，分词的时候，是将整个值一起作为一个关键词建立到倒排索引中的；full text，会经历各种各样的处理，分词，normaliztion（时态转换，同义词转换，大小写转换），才会建立到倒排索引中 

  （5）同时呢，exact value和full text类型的field就决定了，在一个搜索过来的时候，对exact value field或者是full text field进行搜索的行为也是不一样的，会跟建立倒排索引的行为保持一致；比如说exact value搜索的时候，就是直接按照整个值进行匹配，full text query string，也会进行分词和normalization再去倒排索引中去搜索 

  （6）可以用es的dynamic mapping，让其自动建立mapping，包括自动设置数据类型；也可以提前手动创建index和type的mapping，自己对各个field进行设置，包括数据类型，包括索引行为，包括分词器，等等 

  mapping，就是index的type的元数据，每个type都有一个自己的mapping，决定了数据类型，建立倒排索引的行为，还有进行搜索的行为 

 
 正排索引 

 
 搜索的时候，要依靠倒排索引；排序的时候，需要依靠正排索引，看到每个document的每个field，然后进行排序，所谓的正排索引，其实就是doc values 

  在建立索引的时候，一方面会建立倒排索引，以供搜索用；一方面会建立正排索引，也就是doc values，以供排序，聚合，过滤等操作使用 

  doc values是被保存在磁盘上的，此时如果内存足够，os会自动将其缓存在内存中，性能还是会很高；如果内存不足够，os会将其写入磁盘上 

  doc1: hello world you and me 

  doc2: hi, world, how are you 

  word doc1     doc2 

  hello * 

  world * * 

  you * * 

  and * 

  me * 

  hi * 

  how * 

  are * 

  hello you --> hello, you 

  hello --> doc1 

  you --> doc1,doc2 

  doc1: hello world you and me 

  doc2: hi, world, how are you 

  sort by age 

  doc1: { "name": "jack", "age": 27 } 

  doc2: { "name": "tom", "age": 30 } 

  document name age 

  doc1 jack 27 

  doc2 tom 30 

默认的分词器

  standard 

  standard tokenizer：以单词边界进行切分 

  standard token filter：什么都不做 

  lowercase token filter：将所有字母转换为小写 

  stop token filer（默认被禁用）：移除停用词，比如a the it等等 

修改分词器的设置

  启用english停用词token filter 

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "es_std": {
          "type": "standard",
          "stopwords": "_english_"
        }
      }
    }
  }
}

GET /my_index/_analyze
{
  "analyzer": "standard", 
  "text": "a dog is in the house"
}

GET /my_index/_analyze
{
  "analyzer": "es_std",
  "text":"a dog is in the house"
}

定制化自己的分词器

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "&_to_and": {
          "type": "mapping",
          "mappings": ["&=> and"]
        }
      },
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": ["the", "a"]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip", "&_to_and"],
          "tokenizer": "standard",
          "filter": ["lowercase", "my_stopwords"]
        }
      }
    }
  }
}

GET /my_index/_analyze
{
  "text": "tom&jerry are a friend in the house, <a>, HAHA!!",
  "analyzer": "my_analyzer"
}

PUT /my_index/_mapping/my_type
{
  "properties": {
    "content": {
      "type": "text",
      "analyzer": "my_analyzer"
    }
  }
}

夏目 "

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
Elasticsearch 之（12）query string的分词，修改分词器以及自定义分词器

query string分词query string必须以和index建立时相同的analyzer进行分词query string对exact value和full text的区别对待（第10节中详细阐述过）date：exact value_all：full text比如我们有一个document，其中有一个field，包含的value是：hello you and me，建立倒排索引我们要搜索...
复制链接

扫一扫