由于ElasticSearch默认的分词器分词效果不太理想,特别是对于中文的词汇处理过程中(一句话可能被分割成单独的字来处理)。ik分词器不仅智能,还可以支持自定义的词库。
一、ik分词器的安装、测试
1.下载地址:下载地址(建议使用迅雷下载)
2.解压该压缩包,并将解压后的文件内容复制到ES文件下的plugins文件夹中,取名ik,例如
D:\elasticsearch-7.5.0-windows-x86_64\elasticsearch-7.5.0\plugins
3.测试。重启es
测试前的
POST _analyze
{
“analyzer”: “standard”,
“text”: “我喜欢的那个人呢”
}
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "喜",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "欢",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "的",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "那",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
},
{
"token" : "个",
"start_offset" : 5,
"end_offset" : 6,
"type" : "<IDEOGRAPHIC>",
"position" : 5
},
{
"token" : "人",
"start_offset" : 6,
"end_offset" : 7,
"type" : "<IDEOGRAPHIC>",
"position" : 6
},
{
"token" : "呢",
"start_offset" : 7,
"end_offset" : 8,
"type" : "<IDEOGRAPHIC>",
"position" : 7
}
]
}
测试后:第一种
POST _analyze
{
“analyzer”: “ik_smart”,
“text”: “我喜欢的那个人呢”
}
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "喜欢",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "的",
"start_offset" : 3,
"end_offset" : 4,
"type" : "CN_CHAR",
"position" : 2
},
{
"token" : "那个",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "人呢",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 4
}
]
}
测试后:第二种
POST _analyze
{
“analyzer”: “ik_max_word”,
“text”: “我喜欢的那个人呢”
}
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "喜欢",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "的",
"start_offset" : 3,
"end_offset" : 4,
"type" : "CN_CHAR",
"position" : 2
},
{
"token" : "那个人",
"start_offset" : 4,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "那个",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "个人",
"start_offset" : 5,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "人呢",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 6
}
]
}
二、ik分词器的自定义词库
1.查看主词库
D:\elasticsearch-7.5.0-windows-x86_64\elasticsearch-7.5.0\plugins\ik\config\main.dic