未安装分词插件之前只能使用默认的分词规则:
1)普通分词
GET _analyze
{
"text": ["他是一个前端开发工程师"],
"analyzer": "standard"
}
或者:
2)全文分词
GET _analyze
{
"text": ["他是一个前端开发工程师"],
"analyzer": "keyword"
}
分词结果:
{
"tokens" : [
{
"token" : "他",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "一",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "个",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "前",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
},
{
"token" : "端",
"start_offset" : 5,
"end_offset" : 6,
"type" : "<IDEOGRAPHIC>",
"position" : 5
},
{
"token" : "开",
"start_offset" : 6,
"end_offset" : 7,
"type" : "<IDEOGRAPHIC>",
"position" : 6
},
{
"token" : "发",
"start_offset" : 7,
"end_offset" : 8,
"type" : "<IDEOGRAPHIC>",
"position" : 7
},
{
"token" : "工",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<IDEOGRAPHIC>",
"position" : 8
},
{
"token" : "程",
"start_offset" : 9,
"end_offset" : 10,
"type" : "<IDEOGRAPHIC>",
"position" : 9
},
{
"token" : "师",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<IDEOGRAPHIC>",
"position" : 10
}
]
}
使用analysis-ik分词插件
1、安装
[elasticsearch@txvm2019 bin]./elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.2.0/elasticsearch-analysis-ik-7.2.0.zip
安装完成后可以使用list查看是否成功,
[elasticsearch@txvm2019 bin]$ ./elasticsearch-plugin list
analysis-icu
analysis-ik
2、测试:
GET _analyze
{
"text": ["他是一个前端开发工程师"],
"analyzer": "ik_max_word"
}
分词结果:
{
"tokens" : [
{
"token" : "他",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "一个",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "一",
"start_offset" : 2,
"end_offset" : 3,
"type" : "TYPE_CNUM",
"position" : 3
},
{
"token" : "个",
"start_offset" : 3,
"end_offset" : 4,
"type" : "COUNT",
"position" : 4
},
{
"token" : "前端",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "开发",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "工程师",
"start_offset" : 8,
"end_offset" : 11,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "工程",
"start_offset" : 8,
"end_offset" : 10,
"type" : "CN_WORD",
"position" : 8
},
{
"token" : "师",
"start_offset" : 10,
"end_offset" : 11,
"type" : "CN_CHAR",
"position" : 9
}
]
}
3、说明:
IK分词主要有以下两种类型:
ik_max_word:将文本按最细粒度的组合来拆分;
ik_smart::最粗粒度的拆分;
如果在分词的时候不添加分词类别,Elasticsearch对于汉字默认使用standard只是将汉字拆分成一个个的汉字。