Elasticsearch-核心篇(9)-中文分词器(IK)

最新推荐文章于 2024-03-29 07:00:00 发布

TianXinCoord

最新推荐文章于 2024-03-29 07:00:00 发布

阅读量564

点赞数 1

分类专栏： 36_Elasticsearch 文章标签： elasticsearch

本文为博主原创文章，转载请注明出处！

本文链接：https://blog.csdn.net/sinat_34104446/article/details/119851759

版权

36_Elasticsearch 专栏收录该内容

14 篇文章 1 订阅

订阅专栏

文章目录

一、系统分词器
二、IK分词器

一、系统分词器

可以使用GET发送_analyze命令，指定分析器和需要分析的文本内容
标准分析器，按照最小粒度

GET _analyze
{
  "analyzer": "standard",
  "text": ["中国人ABC"]
}

分析结果

{
  "tokens" : [
    {
      "token" : "中",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "国",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "人",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "abc",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

作为关键词，关键词不会拆分

GET _analyze
{
  "analyzer": "keyword",
  "text": ["中国人ABC"]
}

分析结果

{
  "tokens" : [
    {
      "token" : "中国人ABC",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0
    }
  ]
}

二、IK分词器

2.1 IK分词器说明

IK分词器提供两个分词算法：ik_smart、ik_max_word
- ik_smart：最少拆分
- ik_max_word：最为细粒度切分

2.2 IK分词器安装

下载地址：https://github.com/medcl/elasticsearch-analysis-ik/releases
注意事项：版本一定要和ES版本一致
解压ik分词器到es/plugins中，文件夹名称用ik

重启Elasticsearch，安装完成，在界面启动时将会有插件加载信息

2.3 IK分词器使用

可以通过_analyze来测试分词器的使用

GET _analyze 
{
  "analyzer": "分词器类型",
  "text": "我是中国人码坐标"
}

2.3.1 ik_smart

最少拆分

GET _analyze 
{
  "analyzer": "ik_max_word",
  "text": "我是中国人码坐标"
}

拆分结果

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中国人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "码",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "坐标",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

2.3.2 ik_max_word

最为细粒度拆分

GET _analyze 
{
  "analyzer": "ik_max_word",
  "text": "我是中国人码坐标"
}

拆分结果

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中国人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "中国",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "国人",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "码",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "CN_CHAR",
      "position" : 5
    },
    {
      "token" : "坐标",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 6
    }
  ]
}

2.4 自定义数据词典

在elasticsearch/plugins/ik/config下新建.dic文件，例如此处为codecoord.dic
编辑codecoord.dic文件，在其中加入词典，加入的信息在分词器中将会作为一个词语使用，不会进行拆分

编辑ik/config/IKAnalyzer.cfg.xml文件，在ext_dict中加入刚刚创建的codecoord.dic词典，多个使用逗号分开

此时进行分词器的使用，将会作为一个词语显示

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_CHAR",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "中国人",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "码坐标",
      "start_offset" : 5,
      "end_offset" : 8,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "坐标",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

2.5 IK分词器查询

创建索引并指定分析器

PUT index
{
	"mappings": {
		"properties": {
			"content": {
				"type": "text",
				"analyzer": "ik_max_word",
				"search_analyzer": "ik_smart"
			}
		}
	}
}

创建文档

POST index/_doc/1
{
	"content": "美国留给伊拉克的是个烂摊子吗"
}

POST index/_doc/2
{
	"content": "公安部：各地校车将享最高路权"
}

POST index/_doc/3
{
	"content": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"
}

POST index/_doc/4
{
	"content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
}

搜索时指定高亮信息

GET index/_search
{
  "query": {
    "match": {
      "content": "中国"
    }
  },
  "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}

将会在高亮highlight中返回高亮信息

{
  "took" : 50,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.642793,
    "hits" : [
      {
        "_index" : "index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.642793,
        "_source" : {
          "content" : "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"
        },
        "highlight" : {
          "content" : [
            "中韩渔警冲突调查：韩警平均每天扣1艘<tag1>中国</tag1>渔船"
          ]
        }
      },
      {
        "_index" : "index",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.642793,
        "_source" : {
          "content" : "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
        },
        "highlight" : {
          "content" : [
            "<tag1>中国</tag1>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
          ]
        }
      }
    ]
  }
}

TianXinCoord

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Elasticsearch-核心篇(9)-中文分词器(IK)

文章目录一、系统分词器二、IK分词器2.1 IK分词器说明2.2 IK分词器安装2.3 IK分词器使用2.3.1 ik_smart2.3.2 ik_max_word2.4 自定义数据词典2.5 IK分词器查询一、系统分词器可以使用GET发送_analyze命令，指定分析器和需要分析的文本内容标准分析器，按照最小粒度GET _analyze{ "analyzer": "standard", "text": ["中国人ABC"]}分析结果{ "tokens" : [
复制链接

扫一扫