第九节 ES分词器

最新推荐文章于 2024-04-07 23:43:19 发布

hi_kong

最新推荐文章于 2024-04-07 23:43:19 发布

阅读量187

点赞数

分类专栏： elasticsearch 文章标签： elasticsearch ik分词器

本文链接：https://blog.csdn.net/hi_kong/article/details/119573768

版权

elasticsearch 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

前面章节已经安装了分词器，但是关于分词器的具体使用方式，一直没有仔细研究，今天大概研究了下，记录下来作为备忘。

英文分词

英文分词是按照空格来分的，请求参数如下：

POST http://10.140.188.135:9200/_analyze
{
    "text": "hello word"
}

返回内容：
{
    "tokens": [
        {
            "token": "hello",
            "start_offset": 0,
            "end_offset": 5,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "word",
            "start_offset": 6,
            "end_offset": 10,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

无论内容是否正确，都是按照空格来分词：

POST http://10.140.188.135:9200/_analyze
{
    "text": "nihao word"
}

返回内容：
{
    "tokens": [
        {
            "token": "nihao",
            "start_offset": 0,
            "end_offset": 5,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "word",
            "start_offset": 6,
            "end_offset": 10,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

中文分词

默认的分词如下：

POST http://10.140.188.135:9200/_analyze
{
    "text": "生活如此美丽"
}

返回内容：
{
    "tokens": [
        {
            "token": "生",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "活",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "如",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "此",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "美",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "丽",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        }
    ]
}

默认分词把每一个汉字当成一个词来处理，这显然不是我们需要的。所以我们可以执行分词方式。

指定分词器

之前我们安装了IK分词器，IK分词器有两种分词方式，ik_max_word和ik_smart，区别是ik_max_word会做最细粒度的分词，而ik_smart会做最粗粒度的分词。具体可以通过下面自定义分词来理解。

ik_max_word方式分词：

POST http://10.140.188.135:9200/_analyze
{  
    "analyzer": "ik_max_word",
    "text": "生活如此美好"  
}

返回数据：
{
    "tokens": [
        {
            "token": "生活",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "如此",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "美好",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 2
        }
    ]
}

ik_smart分词方式：

POST http://10.140.188.135:9200/_analyze
{  
    "analyzer": "ik_smart",
    "text": "生活如此美好"  
}

返回数据：
{
    "tokens": [
        {
            "token": "生活",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "如此",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "美好",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 2
        }
    ]
}

可以看到，这句话两种分词方式结果一样。下面换一句就能看到区别了：

POST http://10.140.188.135:9200/_analyze
{  
    "analyzer": "ik_max_word",
    "text": "中华人民共和国"  
}

返回数据：
{
    "tokens": [
        {
            "token": "中华人民共和国",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "中华人民",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "中华",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "华人",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "人民共和国",
            "start_offset": 2,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "人民",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "共和国",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 6
        },
        {
            "token": "共和",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 7
        },
        {
            "token": "国",
            "start_offset": 6,
            "end_offset": 7,
            "type": "CN_CHAR",
            "position": 8
        }
    ]
}


POST http://10.140.188.135:9200/_analyze
{  
    "analyzer": "ik_smart",
    "text": "中华人民共和国"  
}

返回数据：
{
    "tokens": [
        {
            "token": "中华人民共和国",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 0
        }
    ]
}

可以看到，ik_max_word的方式，分词更为细致。而ik_smart分词的粒度更粗。

自定义分词

对于专业名字，需要自定义来配置，ik分词器支持自定义分词。具体方法如下。

到ik分词器安装目录下的config文件夹下，可以看到有dic后缀的文件，还有一个IKAnalyzer.cfg.xml文件，其中IKAnalyzer.cfg.xml为配置文件，dic为配置的分词文件。

新建一个mytest.dic文件，输入需要分词的内容（每一行表示一个分词）：

生活如
此美好
活如
此美

修改IKAnalyzer.cfg.xml：

    <!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">mytest.dic</entry>

重启es服务后，再查询结果如下：

POST http://10.140.188.135:9200/_analyze
{  
    "analyzer": "ik_max_word",
    "text": "生活如此美好"  
}

返回数据：
{
    "tokens": [
        {
            "token": "生活如",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "生活",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "活如",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "如此",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "此美好",
            "start_offset": 3,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "此美",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "美好",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 6
        }
    ]
}

POST http://10.140.188.135:9200/_analyze
{  
    "analyzer": "ik_smart",
    "text": "生活如此美好"  
}

返回数据：
{
    "tokens": [
        {
            "token": "生活如",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "此美好",
            "start_offset": 3,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

可以看到，已经按照我们的意愿进行了分词，具体使用那种分词方式，就根据自己的需求确定了。

（在config目录中，原始的文件都是默认的分词的配置，如果把里面的“中华人民共和国”删除，则不会出现上面“中华人民共和国”作为一个整体的分词结果）。

hi_kong

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
第九节 ES分词器

前面章节已经安装了分词器，但是关于分词器的具体使用方式，一直没有仔细研究，今天大概研究了下，记录下来作为备忘。英文分词英文分词是按照空格来分的，请求参数如下：POST http://10.140.188.135:9200/_analyze{ "text": "hello word"}返回内容：{ "tokens": [ { "token": "hello", "start_o...
复制链接

扫一扫

专栏目录