概述
normalization(规范化)
文档规范化,提高召回率。
GET _analyze
{
"text":"Mr.Ma is an excellent teacher",
"analyzer":"standard"
}
// 规范化结果如下:
{
"tokens" : [
{
"token" : "mr.ma",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "is",
"start_offset" : 6,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "an",
"start_offset" : 9,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "excellent",
"start_offset" : 12,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "teacher",
"start_offset" : 22,
"end_offset" : 29,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
规范化就是把文本按照一定规则处理后生成标准化的文本,比如大写字母转换成小写字母,英文时态转换等,为了方便查询是匹配文档。
字符过滤器(character filter)
分词之前的预处理,过滤无用字符。
- HTML Strip Character Filter:type html_strip
参数:escaped_tags 需要保留的html标签 - Mapping Character Filter:type mapping
- Pattern Replace Character Filter:type pattern_replace
html strip character filter
# 创建索引时设置分析器
PUT my_index
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter":{
"type":"html_strip",
"escaped_tags":["a"] # 指定不需要过滤的标签
}
},
"analyzer": {
"my_analyzer":{
"tokenizer":"keyword",
"char_filter":["my_char_filter"]
}
}
}
}
}
# 查询文档指定分析器
GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text":"<p>I'm so <a> happy</a>!</p>"
}
# 查询结果, 可以看到<a>没有被过滤掉
{
"tokens" : [
{
"token" : """
I'm so <a> happy</a>!
""",
"start_offset" : 0,
"end_offset" : 33,
"type" : "word",
"position" : 0
}
]
}
mapping character filter
PUT my_index
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter":{
"type":"mapping",
"mappings":[
"滚 => *",
"垃 => *",
"圾 => *"
]
}
},
"analyzer": {
"my_analyzer":{
"tokenizer":"keyword",
"char_filter":["my_char_filter"]
}
}
}
}
}
GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text":"滚蛋,垃圾人"
}
# 查询结果
{
"tokens" : [
{
"token" : "*蛋,**人",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
}
]
}
pattern replace character filter
正则替换字符过滤器。
PUT my_index
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter":{
"type":"pattern_replace",
"pattern":"(\\d{3})\\d+(\\d{4})",
"replacement":"$1***$2"
}
},
"analyzer": {
"my_analyzer":{
"tokenizer":"keyword",
"char_filter":["my_char_filter"]
}
}
}
}
}
GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text":"您的手机号是:172618001200"
}
# 查询结果
{
"tokens" : [
{
"token" : "您的手机号是:172***1200",
"start_offset" : 0,
"end_offset" : 19,
"type" : "word",
"position" : 0
}
]
}
令牌过滤器(token filter)
注意用来设置停用词,时态转换,大小写转换,同义词转换,语气词处理等。
同义词过滤器示例。
PUT /test_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym":{
"type":"synonym",
"synonyms":["赵,钱,孙=>吴"]
}
},
"analyzer": {
"my_analyzer":{
"tokenizer":"standard",
"filter":["my_synonym"]
}
}
}
}
}
GET /test_index/_analyze
{
"analyzer": "my_analyzer",
"text": ["赵,钱,孙,李"]
}
# 查询结果
{
"tokens" : [
{
"token" : "吴",
"start_offset" : 0,
"end_offset" : 5,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "李",
"start_offset" : 6,
"end_offset" : 7,
"type" : "<IDEOGRAPHIC>",
"position" : 1
}
]
}
分词器(tokenizer)
用于切词。
示例:
GET /test_index/_analyze
{
"tokenizer": "standard",
"text": ["hello word"]
}
# 查询结果
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "word",
"start_offset" : 6,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
常见分词器
- standard analyzer:默认分词器,中文支持的不理想,会逐字拆分。
- pattern tokenizer:以正则匹配分隔符,把文本拆分成若干词项。
- simple pattern tokenizer:以正则匹配词项,速度比pattern tokenizer快。
- whitespace analyzer:以空白符分隔 Tim_cookie
自定义分词器
- char_filter:内置或自定义字符过滤器 。
- token filter:内置或自定义token filter 。
- tokenizer:内置或自定义分词器。
示例
PUT /custom_anlysis
{
"settings": {
"analysis": {
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"& => *"
]
}
},
"filter": {
"my_stopword": {
"type": "stop",
"stopwords": [
"is",
"and"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": ",."
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [
"my_char_filter"
],
"filter": [
"my_stopword"
],
"tokenizer": "my_tokenizer"
}
}
}
}
}
GET /custom_anlysis/_analyze
{
"analyzer": "my_analyzer",
"text": ["what & is ,good man."]
}
# 查看结果
{
"tokens" : [
{
"token" : "what * is ",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 0
},
{
"token" : "ood man.",
"start_offset" : 12,
"end_offset" : 20,
"type" : "word",
"position" : 1
}
]
}
中文分词器(ik分词器)
安装和部署
- ik下载地址:https://github.com/medcl/elasticsearch-analysis-ik
- Github加速器:https://github.com/fhefh2015/Fast-GitHub
- 创建插件文件夹 cd your-es-root/plugins/ && mkdir ik
- 将插件解压缩到文件夹 your-es-root/plugins/ik
- 重新启动es
IK文件描述
- IKAnalyzer.cfg.xml:IK分词配置文件
- 主词库:main.dic
- 英文停用词:stopword.dic,不会建立在倒排索引中
- 特殊词库:
- quantifier.dic:特殊词库:计量单位等
- suffix.dic:特殊词库:行政单位
- surname.dic:特殊词库:百家姓
- preposition:特殊词库:语气词
- 自定义词库:网络词汇、流行词、自造词等
ik提供的两种analyzer
- ik_max_word会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合,适合 Term Query;
- ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”,适合 Phrase 查询。
热更新
远程词库文件
- 优点:上手简单
- 缺点:
- 词库的管理不方便,要操作直接操作磁盘文件,检索页很麻烦
- 文件的读写没有专门的优化性能不好
- 多一层接口调用和网络传输
ik访问数据库(需要修改ik源码,指定mysql中的扩展表) - MySQL驱动版本兼容性
- https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-versions.html
- https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-versions.html
- 驱动下载地址
- https://mvnrepository.com/artifact/mysql/mysql-connector-java