ElasticSearch 分词器、自定义分词器以及倒排索引

倒排索引

索引(正排索引)

索引或者叫做正排索引的概念,想必都不陌生,有个非常经典的例子,就是书本的目录,索引就类似目录,你可以根据目录的页码很方便的找到你所要找到的内容地址。
在这里插入图片描述
如上这张图片,我们可以很方便的根据目录(索引)来找到这一章节的位置,这就是正排索引,关系型数据库用的就是正排索引,mysql 的 b数构建主键索引树时,其实就是主键ID指向地址。

倒排索引

倒排索引的出现,是为了解决海量数据搜索的问题出现的。一个搜索引擎,如果有几十亿的数据,你想从这几十亿的数据中捞出你想要的数据,如果按照索引的概念来搜索,是个非常巨大的工作量,首先你需要一次次的翻找目录找到页码才可以定位到内容地址,但是如果有一种索引,可以直接根据某个单词找到页码在找到内容地址,这样无疑会省事很多。
倒排索引就是一种这样的索引结构。(其实就是正排索引倒着来,正排索引是由页码到内容,倒排索引是由内容单词到页码再到内容地址。)
在这里插入图片描述

索引 与 倒排索引 的例子

这里举了一个例子,有三个文档,分别是 Mastering ElasticSearch 、 ElasticSearch Server、ElasticSearch Essentials,正排索引的话,其实就是左边的结构,文档id 直接映射到文档内容上,但是如果是倒排索引的话,则会先将这三个文档进行分词,分成 ElasticSearch 、Mastering 、Server、Essentials 这四个词项(Term),然后还会对这四个词项的出现次数做一个统计,并且还会为每一个词项出现在的文档id以及出现的位置做一个统计 (DocumentId:Position),比如 ElasticSearch 出现在了第一个文档的第二个位置,那么对应的就是 1:1 文档ID:出现的位置(从0开始)。
在这里插入图片描述
统计词项的时候,会穷举统计词项出现在所有文档的位置,这样搜索的时候,就可以直接通过 ElasticSearch 找到文档id并且还能找到出现位置高亮显示。

ElasticSearch 倒排索引的组成

  • 单词词典(Iterm),记录所有文档的单词,记录单词到倒排列表的关系
    单词词典一般通过 b+ 树实现
  • 倒排列表(Posting List) 记录了单词对应文档id的集合,集合中的对象称为倒排索引项,一个索引项包含以下几个内容
    - 文档Id (Doc ID)
    - 词频 (TF) (单词出现的次数, ElastticSearch 相关性得分就是根据这个算的)
    - 位置(Position),单词在文档中出现的位置,用于语句搜索
    - 偏移量 (Offset),记录单词的开始结束位置,用于高亮显示
    在这里插入图片描述
    以上就是 ElasticSearch 的一个简单的文档倒排索引,这个例子统计的是 ElasticSearch 词项,可以看到ElasticSearch 这个词项的倒排列表有文档ID、TF、Position、Offset等信息。
    比如:ElasticSearch 这个单词在第一个文档中出现了一次,位置是第二个单词,起始位置是从第10个字母开始到23个字母结束。

分词器

Analysis 分析 与 Analyzer 分词器

  • ElasticSearch中的 Analysis 分析这个概念更像是一个过程,是指把全文本转化成一系列单词的过程,也叫分词。
  • Analyzer 分词器是一种功能工具,分词就是通过分词器来实现的,分词器就是将具体的文本分解成一个个单词,比如 ElasticSearch Engilist,会被分词器分解成 ElasticSearch 与 English 两个单词
  • ElasticSearch 在查询的时候,也是需要进行分词器将查询文本进行分词的。

Analyzer 分词器的组成

  • Character Filters, 主要是对原始文本做一些处理过滤,比如将 html 标签给去除掉不参与分词,例如:]< text >ElasticSearch Server</ text >,经过Character Filters会变成 ElasticSearch Server
  • Tokenizer 按照一定的规则切分单词,比如按照空格去切分,ElasticSearch Server 会被切分成 ElasticSearch 与 Server 两个单词
  • Token Filter 对切分出来的单词进行进一步处理(小写、过滤掉停用词、同义词处理、复数单词处理),比如将将大写变成小写,ElasticSearch 与 Server 经过这一步后会变成 elasticsearch 与 server

测试分词效果 _analyzer API

_analyzer API 是ElasticSearch 开放出来的一款可以针对指定的分词器测试分词效果

  • _analyzer API 可以直接指定 Analyzer 进行测试,不需要指定索引
  • _analyzer API 可以指定索引的字段进行分词测试
  • _analyzer API 还可以自定义分词器进行测试
直接指定 Analyzer 进行测试

请求格式:GET + _analyze + 请求体(指定分词器 + 要分词的文本)

#standard
GET _analyze
{
   
  "analyzer": "standard",
Harness the power of ElasticSearch to build and manage scalable search and analytics solutions with this fast-paced guide About This Book New to ElasticSearch? Here's what you need―a highly practical guide that gives you a quick start with ElasticSearch using easy-to-follow examples; get up and running with ElasticSearch APIs in no time Get the latest guide on ElasticSearch 2.0.0, which contains concise and adequate information on handling all the issues a developer needs to know while handling data in bulk with search relevancy Learn to create large-scale ElasticSearch clusters using best practices Learn from our experts―written by Bharvi Dixit who has extensive experience in working with search servers (especially ElasticSearch) Who This Book Is For Anyone who wants to build efficient search and analytics applications can choose this book. This book is also beneficial for skilled developers, especially ones experienced with Lucene or Solr, who now want to learn Elasticsearch quickly. What You Will Learn Get to know about advanced Elasticsearch concepts and its REST APIs Write CRUD operations and other search functionalities using the ElasticSearch Python and Java clients Dig into wide range of queries and find out how to use them correctly Design schema and mappings with built-in and custom analyzers Excel in data modeling concepts and query optimization Master document relationships and geospatial data Build analytics using aggregations Setup and scale Elasticsearch clusters using best practices Learn to take data backups and secure Elasticsearch clusters In Detail With constantly evolving and growing datasets, organizations have the need to find actionable insights for their business. ElasticSearch, which is the world's most advanced search and analytics engine, brings the ability to make massive amounts of data usable in a matter of milliseconds. It not only gives you the power to build blazing fast search solutions over a massive amount of data, but can also serve as a NoSQL data store. This guide will take you on a tour to become a competent developer quickly with a solid knowledge level and understanding of the ElasticSearch core concepts. Starting from the beginning, this book will cover these core concepts, setting up ElasticSearch and various plugins, working with analyzers, and creating mappings. This book provides complete coverage of working with ElasticSearch using Python and performing CRUD operations and aggregation-based analytics, handling document relationships in the NoSQL world, working with geospatial data, and taking data backups. Finally, we'll show you how to set up and scale ElasticSearch clusters in production environments as well as providing some best practices. Style and approach This is an easy-to-follow guide with practical examples and clear explanations of the concepts. This fast-paced book believes in providing very rich content focusing majorly on practical implementation. This book will provide you with step-by-step practical examples, letting you know about the common errors and solutions along with ample screenshots and code to ensure your success. Table of Contents Chapter 1. Getting Started with Elasticsearch Chapter 2. Understanding Document Analysis and Creating Mappings Chapter 3. Putting Elasticsearch into Action Chapter 4. Aggregations for Analytics Chapter 5. Data Looks Better on Maps: Master Geo-Spatiality Chapter 6. Document Relationships in NoSQL World Chapter 7. Different Methods of Search and Bulk Operations Chapter 8. Controlling Relevancy Chapter 9. Cluster Scaling in Production Deployments Chapter 10. Backups and Security
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值