知识点 1 :分词器插件安装
如何查看elasticsearch已经安装了什么插件
在浏览器中输入 http://es的ip地址/_cat/plugins
分词器插件安装,下载对应版本,解压到plugins目录,重启
analysis-icu分词器
https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html
ik分词器
https://github.com/medcl/elasticsearch-analysis-ik/releases
https://www.cnblogs.com/dgwblog/p/12374212.html
知识点 2 :通过Analyzer进行分词
Analysis:即文本分析,是把全文本转化为一系列单词(term/token)的过程,也叫分词;在Elasticsearch 中可通过内置分词器实现分词,也可以按需定制分词器。
Analyzer 由三部分组成
• Character Filters:原始文本处理,如去除 html
• Tokenizer:按照规则切分为单词
• Token Filters:对切分单词加工、小写、删除 stopwords,增加同义词
Analyzer API
通过三种方式查看 Analyzer 如何进行工作
• 直接指定 Analyzer 进行测试
• 指定索引字段进行测试
• 自定义分词器进行测试
Elasticsearch 内置分词器
Stop Analyzer :Simple Analyzer +停用词过滤(the,is ,a,in,to等助词)
Language:按照语言特点分词,如下英语,Stop Analyzer +词转换(单复数等)
IK分词PK官方ICU分词器
GET /_analyze
{
"analyzer":"icu_analyzer",
"text":"长风破浪会有时,直挂云帆济沧海"
}
{
"tokens" : [
{
"token" : "长风破浪",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "会",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "有时",
"start_offset" : 5,
"end_offset" : 7,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "直",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "挂",
"start_offset" : 9,
"end_offset" : 10,
"type" : "<IDEOGRAPHIC>",
"position" : 4
},
{
"token" : "云",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<IDEOGRAPHIC>",
"position" : 5
},
{
"token" : "帆",
"start_offset" : 11,
"end_offset" : 12,
"type" : "<IDEOGRAPHIC>",
"position" : 6
},
{
"token" : "济",
"start_offset" : 12,
"end_offset" : 13,
"type" : "<IDEOGRAPHIC>",
"position" : 7
},
{
"token" : "沧海",
"start_offset" : 13,
"end_offset" : 15,
"type" : "<IDEOGRAPHIC>",
"position" : 8
}
]
}
GET /_analyze
{
"analyzer":"ik_max_word",
"text":"长风破浪会有时,直挂云帆济沧海"
}
{
"tokens" : [
{
"token" : "长风破浪",
"start_offset" : 0,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "长风",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "破浪",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "会有",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "有时",
"start_offset" : 5,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "直",
"start_offset" : 8,
"end_offset" : 9,
"type" : "CN_CHAR",
"position" : 5
},
{
"token" : "挂",
"start_offset" : 9,
"end_offset" : 10,
"type" : "CN_CHAR",
"position" : 6
},
{
"token" : "云",
"start_offset" : 10,
"end_offset" : 11,
"type" : "CN_CHAR",
"position" : 7
},
{
"token" : "帆",
"start_offset" : 11,
"end_offset" : 12,
"type" : "CN_CHAR",
"position" : 8
},
{
"token" : "济",
"start_offset" : 12,
"end_offset" : 13,
"type" : "CN_CHAR",
"position" : 9
},
{
"token" : "沧海",
"start_offset" : 13,
"end_offset" : 15,
"type" : "CN_WORD",
"position" : 10
}
]
}