Elasticsearch 之（10）倒排索引核心原理、分词器、精确匹配与全文搜索

最新推荐文章于 2024-08-13 17:43:38 发布

夏目 "

最新推荐文章于 2024-08-13 17:43:38 发布

阅读量2.8k

点赞数

分类专栏： Elasticsearch Elasticsearch 文章标签： elasticsearch kibana

本文链接：https://blog.csdn.net/wuzhiwei549/article/details/80394894

版权

Elasticsearch 同时被 2 个专栏收录

56 篇文章 38 订阅

订阅专栏

Elasticsearch

55 篇文章 11 订阅

订阅专栏

倒排索引核心原理

  doc1：I really liked my small dogs, and I think my mom also liked them. 

  doc2：He never liked any dogs, so I hope that my mom will not expect me to liked him. 

  分词，初步的倒排索引的建立 

  word doc1 doc2 

  I * * 

  really * 

  liked * * 

  my * * 

  small * 

  dogs * 

  and * 

  think * 

  mom * * 

  also         * 

  them * 

  He         * 

  never * 

  any         * 

  so         * 

  hope * 

  that         * 

  will         * 

  not         * 

  expect * 

  me         * 

  to         * 

  him         * 

  演示了一下倒排索引最简单的建立的一个过程 

  搜索 

  mother like little dog，不可能有任何结果 

  mother 

  like 

  little 

dog

  这个是不是我们想要的搜索结果？？？绝对不是，因为在我们看来，mother和mom有区别吗？同义词，都是妈妈的意思。like和liked有区别吗？没有，都是喜欢的意思，只不过一个是现在时，一个是过去时。little和small有区别吗？同义词，都是小小的。dog和dogs有区别吗？狗，只不过一个是单数，一个是复数。 

  normalization，建立倒排索引的时候，会执行一个操作，也就是说对拆分出的各个单词进行相应的处理，以提升后面搜索的时候能够搜索到相关联的文档的概率 

  时态的转换，单复数的转换，同义词的转换，大小写的转换 

  mom —> mother 

  liked —> like 

  small —> little 

  dogs —> dog 

  重新建立倒排索引，加入normalization，再次用mother liked little dog搜索，就可以搜索到了 

  word doc1     doc2 

  I         * * 

  really * 

  like         * * liked --> like 

  my         * * 

  little          * small --> little 

  dog         * * dogs --> dog 

  and         * 

  think * 

  mom * * 

  also         * 

  them * 

  He         * 

  never * 

  any          * 

  so          * 

  hope * 

  that          * 

  will          * 

  not         * 

  expect * 

  me          * 

  to          * 

  him          * 

  mother like little dog，分词，normalization 

  mother --> mom 

  like --> like 

  little --> little 

  dog --> dog 

  doc1和doc2都会搜索出来 

  doc1：I really liked my small dogs, and I think my mom also liked them. 

  doc2：He never liked any dogs, so I hope that my mom will not expect me to liked him. 

  倒排索引，是适合用于进行搜索的 

  倒排索引的结构 

  （1）包含这个关键词的document list 

  （2）包含这个关键词的所有document的数量：IDF（inverse document frequency） 

  （3）这个关键词在每个document中出现的次数：TF（term frequency） 

  （4）这个关键词在这个document中的次序 

  （5）每个document的长度：length norm 

  （6）包含这个关键词的所有document的平均长度 

  word doc1 doc2 

  dog         * * 

  hello * 

  you         * 

  倒排索引不可变的好处 

  （1）不需要锁，提升并发能力，避免锁的问题 

  （2）数据不变，一直保存在os cache中，只要cache内存足够 

  （3）filter cache一直驻留在内存，因为数据不变 

  （4）可以压缩，节省cpu和io开销 

  倒排索引不可变的坏处：每次都要重新构建整个索引 

分词器

  1、什么是分词器 

  切分词语，normalization（提升recall召回率） 

  给你一段句子，然后将这段句子拆分成一个一个的单个的单词，同时对每个单词进行normalization（时态转换，单复数转换），分瓷器 

  recall，召回率：搜索的时候，增加能够搜索到的结果的数量 

  character filter：在一段文本进行分词之前，先进行预处理，比如说最常见的就是，过滤html标签（<span>hello<span> --> hello），& --> and（I&you --> I and you） 

  tokenizer：分词，hello you and me --> hello, you, and, me 

  token filter：lowercase，stop word，synonymom，dogs --> dog，liked --> like，Tom --> tom，a/the/an --> 干掉，mother --> mom，small --> little 

  一个分词器，很重要，将一段文本进行各种处理，最后处理好的结果才会拿去建立倒排索引 

  2、内置分词器的介绍 

  Set the shape to semi-transparent by calling set_trans(5) 

  standard analyzer：set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默认的是standard） 

  simple analyzer：set, the, shape, to, semi, transparent, by, calling, set, trans 

  whitespace analyzer：Set, the, shape, to, semi-transparent, by, calling, set_trans(5) 

  language analyzer（特定的语言的分词器，比如说，english，英语分词器）：set, shape, semi, transpar, call, set_tran, 5 

精确匹配与全文搜索

  1、exact value 

  2017-01-01，exact value，搜索的时候，必须输入2017-01-01，才能搜索出来 

  如果你输入一个01，是搜索不出来的 

  2、full text 

  （1）缩写 vs. 全程：cn vs. china 

  （2）格式转化：like liked likes 

  （3）大小写：Tom vs tom 

  （4）同义词：like vs love 

  2017-01-01，2017 01 01，搜索2017，或者01，都可以搜索出来 

  china，搜索cn，也可以将china搜索出来 

  likes，搜索like，也可以将likes搜索出来 

  Tom，搜索tom，也可以将Tom搜索出来 

  like，搜索love，同义词，也可以将like搜索出来 

  就不是说单纯的只是匹配完整的一个值，而是可以对值进行拆分词语后（分词）进行匹配，也可以通过缩写、时态、大小写、同义词等进行匹配 

夏目 "

关注

0
点赞
踩
8

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

Elasticsearch 之（10） 倒排索引核心原理、分词器、精确匹配与全文搜索

倒排索引核心原理

分词器

精确匹配与全文搜索

Elasticsearch 之（10）倒排索引核心原理、分词器、精确匹配与全文搜索