ElasticSearch中的分词器、Mapping、Search及Java API访问

最新推荐文章于 2022-07-17 09:28:22 发布

赵大土

最新推荐文章于 2022-07-17 09:28:22 发布

阅读量772

点赞数

文章标签： ElasaticSearch

本文链接：https://blog.csdn.net/weixin_43647393/article/details/103641089

版权

本文介绍了ElasticSearch中的分词器，包括standard analyzer、simple analyzer、whitespace analyzer和language analyzer，重点讲解了IK中文分词器的安装和配置。接着，探讨了ES的Mapping，包括核心数据类型、dynamic mapping和custom mapping。接着，详细阐述了Search操作，如全搜索、multi index搜索、条件搜索和DSL。最后，讲解了如何通过Java API进行文档操作和测试搜索。

摘要由CSDN通过智能技术生成

ES常见分词器

standard analyzer

ES的默认分词器
不忽略停止词（如：the、a、an等），会进行大小写转换、过滤连接符

GET _analyze
{
   
	"text":"I'm a chinese, i came from China.",
	"analyzer":"standard"
}

simple analyzer

拆成单词，符号、数字全部过滤，经常破坏英语语法

whitespace analyzer

将所有空格、回车等作为分隔符

language analyzer

语言分词器，包含多个具体分词器，提高的常用分词器都是英语相关的，对中文的是一字一词

GET _analyze
{
   
	"text":"I'm a chinese, i came from China.",
	"analyzer":"english"
}

IK中文分词器

1.安装方式

需要在Github中检索elasticsearch-analysis-ik下载源码，本地编译打包

# 集群中所有机器都需要安装，且必须安装在ES的plugins目录下
# 在ES的plugins目录下创建ik子目录
cd /usr/local/es/plugins/
mkdir ik

# 上传IK压缩包至ik目录
# 解压
unzip xx.zip

# 用户授权
chown khue.khue -R /usr/local/es/plugins/ik
# 重启ES生效

# 测试
```shell
GET _analyze
{
   
	"text":"中华人民共和国国歌",
	# 简单分词
	"analyzer":"ik_smart"
	# 完整分词
	"analyzer":"ik_max_word"
}

2.配置文件

IKAnalyzer.cfg.xml：用于配置自定义词库

dic文件必须使用utf-8

特殊词库文件：
extra_main.dic
extra_single_word.dic
extra_single_word_full.dic
extra_single_word_low_freq.dic
extra_stopword.dic

main.dic：一般词库，基本来源于辞海
preposition.dic：一般停用词
stopword.dic：英文停用词
quantifier.dic：量词
suffix.dic：后缀词
surname.dic：姓氏词

ES中的Mapping

mapping决定了indx中的field的特征

# 查看索引的映射信息
GET test_index/_mapping

1.核心数据类型

文本（字符串）：text
整数：byte short integer long
浮点型：float double
布尔类型：boolean
日期类型：date 默认使用yyyy-MM-dd HH-mm-ss
数组类型：{a:[]}
对象类型：{a:{}}
不分词的字符串(关键字)：keyword 需要特殊指定

2.dynamic mapping对字段的类型分配

123 — long
12.12 — double
2018-12-12 — date
hello world — text
true — boolean
[] — array
{} — object

3.custom mapping

使用命令，在创建index和type时定制mapping映射

不能修改已存在字段的mapping
不存在的可以新增mapping

PUT test_mapping
{
   
	"settings":{
   
		"number_of_shards":2,
		"number_of_replicas":1
	},
	"mapping":{
   
		# 7.x版本不需要写type，直接写properties
		"test_type":{
   
			"properties":{
   
				"name":{
   
					"type":"text",
					"nalyzer":"ik_smart"
				},
				"age":{
   
					"type":"long"
				},
				"informartion":{
   
					"type":"text",
					analyzer:ik_max_word,
					# 不需要分词的可使用子字段，利于排序和聚合
					"fields":{
   
						# 使用的是正排索引，利于排序和聚合
						"keyword":{
   
							"type":"keyword",
							# 保留父字段前256个字符，不分词
							"ignore_above":256
						}
					}
				}
			}	
		}
	}
}

# 生效测试
GET test_mapping/_analyze
{
   
	"text":"xx"
	"field":"information"
	# 不分词
	"text":"xx"
	"field":"information.keyword"
}

# 增加新字段
PUT test_mapping/_mapping/test_type
{
   
	"properties":{
   
		"password":{
   
			"type":"keyword",
			# 不为其创建索引，默认为true
			"index":false
		}
	}
}

Search

queryString - 请求头传参
dsl - 请求体传参

1.全搜索

# 默认分页，只显示第一页，显示10条数据
GET _search
# 限制时间
GET _search?timeout=10ms

2.multi index 搜索

GET test_index,test_search/_search

3.魔鬼搜索

无条件的搜索

4.条件搜索

# 中文无法正确查询，因为请求头默认编码会导致中文乱码
# 编码解决或tomcat配置
GET test_index/_search?q=khue

# 指定字段
GET test_index/_search?q=name:khue

# +包含，-不包含
GET test_index/_search?q=-name:khue

# 分页及排序asc升序 desc降序
GET test_index/_search?from=0&size=2&sort=age:asc

DSL

Domain Specified Language - 特殊领域语言
对比query string功能更加丰富

# 查询所有
GET test_index/_search
{
   
	"query":{
   
		"match_all":{
   }
	}
}

# 条件搜索
GET test_index/_search
{
   
	"query":{
   
		"match":{
   
			"name":"奎"
		}
	}
}

# 短语匹配-查询条件不分词，在倒排和正排索引中搜索，精准匹配
GET test_index/_search
{
   
	"query":{
   
		"match_phrase":{
   
			"gender"