一、版本和对应关系
plugin | elasticsearch |
---|---|
7.6.2 | 7.6.2 |
7.7.0 | 7.7.0 |
7.7.1 | 7.7.1 |
7.8.0 | 7.8.0 |
7.8.1 | 7.8.1 |
7.9.0 | 7.9.0 |
7.9.1 | 7.9.1 |
7.9.2 | 7.9.2 |
7.9.3 | 7.9.3 |
二、安装步骤
1、下载安装ES对应Plugin Release版本
a. GitHub - NLPchina/elasticsearch-analysis-ansj
b. 解压 elasticsearch-analysis-ansj-7.7.1-release.zip 到 plugins 目录下
c.将 ansj.cfg.xml 拷贝到 es 对应的 config 目录下
d.在es config 同级目录创建 library目录用于放置分词数据,将词库信息放入该目录
自定义词库(default.dic),停词词库(stop.dic),歧义词词库(ambiguity.dic),同义词词库(synonyms.dic)
2、重启Elasticsearch
三、分词方式
1、分词方式解析
base_ansj | 基本分词 |
index_ansj | 索引分词,拆分的最细 |
query_ansj | 查询分词 |
dic_ansj | 用户自定义分词 |
nlp_ansj | 自然语言分词 |
2、样例
POST _analyze
{
"text": ["美国阿拉斯加州发生8.0级地震"],
"analyzer": "index_ansj"
}
结果
{
"tokens" : [
{
"token" : "美国",
"start_offset" : 0,
"end_offset" : 2,
"type" : "ns",
"position" : 0
},
{
"token" : "美",
"start_offset" : 0,
"end_offset" : 1,
"type" : "b",
"position" : 1
},
{
"token" : "国",
"start_offset" : 1,
"end_offset" : 2,
"type" : "n",
"position" : 2
},
{
"token" : "阿拉斯加州",
"start_offset" : 2,
"end_offset" : 7,
"type" : "nsf",
"position" : 3
},
{
"token" : "阿拉斯加",
"start_offset" : 2,
"end_offset" : 6,
"type" : "nsf",
"position" : 4
},
{
"token" : "阿拉斯",
"start_offset" : 2,
"end_offset" : 5,
"type" : "nsf",
"position" : 5
},
{
"token" : "阿拉",
"start_offset" : 2,
"end_offset" : 4,
"type" : "r",
"position" : 6
},
{
"token" : "阿",
"start_offset" : 2,
"end_offset" : 3,
"type" : "b",
"position" : 7
},
{
"token" : "拉斯",
"start_offset" : 3,
"end_offset" : 5,
"type" : "nrf",
"position" : 8
},
{
"token" : "拉",
"start_offset" : 3,
"end_offset" : 4,
"type" : "v",
"position" : 9
},
{
"token" : "斯",
"start_offset" : 4,
"end_offset" : 5,
"type" : "b",
"position" : 10
},
{
"token" : "加州",
"start_offset" : 5,
"end_offset" : 7,
"type" : "ns",
"position" : 11
},
{
"token" : "加",
"start_offset" : 5,
"end_offset" : 6,
"type" : "v",
"position" : 12
},
{
"token" : "州",
"start_offset" : 6,
"end_offset" : 7,
"type" : "n",
"position" : 13
},
{
"token" : "发生",
"start_offset" : 7,
"end_offset" : 9,
"type" : "v",
"position" : 14
},
{
"token" : "发",
"start_offset" : 7,
"end_offset" : 8,
"type" : "v",
"position" : 15
},
{
"token" : "生",
"start_offset" : 8,
"end_offset" : 9,
"type" : "v",
"position" : 16
},
{
"token" : "8.0级",
"start_offset" : 9,
"end_offset" : 13,
"type" : "mq",
"position" : 17
},
{
"token" : "0",
"start_offset" : 11,
"end_offset" : 12,
"type" : "w",
"position" : 18
},
{
"token" : "级",
"start_offset" : 12,
"end_offset" : 13,
"type" : "q",
"position" : 19
},
{
"token" : "地震",
"start_offset" : 13,
"end_offset" : 15,
"type" : "n",
"position" : 20
},
{
"token" : "地",
"start_offset" : 13,
"end_offset" : 14,
"type" : "ude2",
"position" : 21
},
{
"token" : "震",
"start_offset" : 14,
"end_offset" : 15,
"type" : "vi",
"position" : 22
}
]
}
四、ansj暴露的api整理
请求链接 | 描述 |
/_cat/ansj | 执行分词 |
/_cat/ansj/config | 显示全部配置 |
/_ansj/flush/config | 刷新全部配置 |
/_ansj/flush/config/single | 执行刷新配置 |
/_ansj/flush/dic | 更新全部词典 |
/_ansj/flush/dic/single | 执行更新词典 |
http://127.0.0.1:9200/_ansj/flush/dic/single?key=dic
/_cat/ansj 执行分词
例子:/_cat/ansj?text=中国&type=index_ansj&dic=dic&stop=stop&ambiguity=ambiguity&synonyms=synonyms
其中text和type是必须传的:text为需要进行分词的语句,type是分词类型,支持如下