[ES]一基础|正排索引和倒排索引 | ES和MySQLd的对比 | 默认分词器 | IK分词器 | 扩展、停用ik分词器的词库

胖胖学编程

已于 2023-08-29 09:54:00 修改

阅读量483

点赞数

分类专栏： ES 文章标签： elasticsearch 大数据搜索引擎

于 2023-08-17 16:47:33 首次发布

本文链接：https://blog.csdn.net/qq_35896718/article/details/132344563

版权

ES 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

参考 https://www.bilibili.com/video/BV1b8411Z7w5?p=6

一、正排索引和倒排索引

1、ES采用倒排索引

1）文档(document)：每条数据就是一个文档，在mysql中一个文档就是一条数据，在网页中，一个文档就是一个网页

2）词条(term)：文档按照语义分成的词语（中文的话按照中文的词分、英语按照英文分）

3）存储过程：将文档从第一行开始，一行一行的进行分词，存成两个字段：词条和文档id，出现过的词条只要追加文档id即可。词条是唯一的，绝对不会重复，然后为词条创建索引。

4）查询过程：搜索华为手机，先对用户输入的内容进行分词，拿着词条去倒排索引中进行查询，因为所有的词条都已经建立索引，所以查询速度很快。查询"华为"得到文档id2和3，查询手机得到文档id1和2，因此可知道2号文档两个词条都包含。1和3文档只包含一个词。之后拿着id去查询文档。将文档放到结果集中。

查询一共进行了两次检索：第一次根据用户输入的词条去词条列表找到对应的文档id，第二次拿着文档id找文档。但每次都经过索引进行查询，查询效率比较高。

5）正向索引与倒排索引

正向索引一行一行的从上到下遍历文档，通过文档中找词。

倒排索引先找到词条对应的id，再去找文档，是通过词找文档。

二、ES和MySQL的对比

1、格式不同：ES的每行数据以json串的格式进行存储。

2、索引(index)：相同类型的文档的集合。相当于MySQL中的表。

3、映射(mapping)：索引中文档的字段约束信息，类似表中字段和字段的数据类型。

4、MySQL和ES的对比

三、分词器

1、默认分词器(analyzer)

可选的有standard、english、chinese但是他们的中文都是一个字分一个词

POST /_analyze
{
  "text":"胖胖and笨笨都是可爱的小猫猫",
  "analyzer":"chinese"
}

2、ik分词器

1）安装Ik分词器

[ES]mac安装es、kibana、ik分词器_胖胖学编程的博客-CSDN博客

2）ik分词器有两种模式:

①ik_smart：最少切分

POST /_analyze
{
  "text":"胖胖and笨笨都是可爱的小猫猫",
  "analyzer":"ik_smart"
}


结果
{
  "tokens" : [
    {
      "token" : "胖胖",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "笨笨",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "都是",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "可爱",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "的",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "小猫猫",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "CN_WORD",
      "position" : 5
    }
  ]
}

②ik_max_word：最细切分，切分成词之后会查看词是否还能切分，如果能则继续切分。因为切的更细所以搜索概率高，占内存多。

POST /_analyze
{
  "text":"胖胖and笨笨都是可爱的小猫猫",
  "analyzer":"ik_max_word"
}

结果
{
  "tokens" : [
    {
      "token" : "胖胖",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "笨笨",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "都是",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "可爱",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "的",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "小猫猫",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "小猫",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "猫猫",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "CN_WORD",
      "position" : 7
    }
  ]
}

四、IK分词器的扩展和停用词典

1、例子

扩展词典：米哈游和原神都没有被识别为词，因为ik词典里没有这些词。

停用词典：而“的”、“了”又没有必要分词。还有一些禁词，违禁品、国家领导人这种都应该被禁掉。

POST /_analyze
{
  "text":"米哈游的原神太牛皮了",
  "analyzer":"ik_max_word"
}


{
  "tokens" : [
    {
      "token" : "米",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "哈",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "游",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "的",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "原",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "神",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "CN_CHAR",
      "position" : 5
    },
    {
      "token" : "太",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 6
    },
    {
      "token" : "牛皮",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "了",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "CN_CHAR",
      "position" : 8
    }
  ]
}

2、扩展、停用ik分词器的词库

1)编辑IKAnalyzer.cfg.xml

进入docker的命令行，运行：

cd /usr/share/elasticsearch/plugins/elasticsearch-analysis-ik-7.12.1/config
vi IKAnalyzer.cfg.xml

添加这两块

2）编辑ext.dic、stopword.dic

在当前路径下创建ext.dic

 vi ext.dic
添加:
米哈游
原神

编辑stopword.dic（该文件本身就存在），添加:了、的（如果添加的文件cat还是乱码，就自己创建一个同名文件,把原来的字段粘贴进去,再添加自己的字段）

3）重启es

4）测试

POST /_analyze
{
  "text":"米哈游的原神太牛皮了",
  "analyzer":"ik_smart"
}

结果:
{
  "tokens" : [
    {
      "token" : "米哈游",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "原神",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "太",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "牛皮",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}