elasticsearch系统学习笔记5-中文分词器

morningcat2018

已于 2022-01-29 00:43:07 修改

阅读量407

点赞数

分类专栏： elasticsearch系统学习笔记文章标签： elasticsearch 中文分词搜索引擎

于 2022-01-28 23:50:42 首次发布

本文链接：https://blog.csdn.net/u013837825/article/details/122738216

版权

elasticsearch系统学习笔记专栏收录该内容

15 篇文章 2 订阅

订阅专栏

elasticsearch系统学习笔记5-中文分词器

IK

https://github.com/medcl/elasticsearch-analysis-ik

Analyzer: ik_smart , ik_max_word
Tokenizer: ik_smart , ik_max_word

下载

下载地址 https://github.com/medcl/elasticsearch-analysis-ik/releases

本机下载 elasticsearch-analysis-ik-6.3.2.zip

解压

目录名改为 analysis-ik

// 3. 将 analysis-ik/config 文件夹下的内容移动到 {ES_HOME}/config 目录下

将解压后的目录移动到 {ES_HOME}/plugins 目录下
重启 ES 服务

启动窗口加载日志会多出一些加载插件的信息：[YABPFPe] loaded plugin [analysis-ik]

测试：

GET /_analyze
{
  "text": "中华人民共和国国歌",
  "analyzer": "ik_smart"
}

{
  "tokens": [
    {
      "token": "中华人民共和国",
      "start_offset": 0,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "国歌",
      "start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

GET /_analyze
{
  "text": "中华人民共和国国歌",
  "analyzer": "ik_max_word"
}

{
  "tokens": [
    {
      "token": "中华人民共和国",
      "start_offset": 0,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "中华人民",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "中华",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "华人",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "人民共和国",
      "start_offset": 2,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "人民",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "共和国",
      "start_offset": 4,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "共和",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 7
    },
    {
      "token": "国",
      "start_offset": 6,
      "end_offset": 7,
      "type": "CN_CHAR",
      "position": 8
    },
    {
      "token": "国歌",
      "start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 9
    }
  ]
}

analysis-hanlp

https://github.com/KennFalcon/elasticsearch-analysis-hanlp

提供的分词方式

hanlp: 默认分词
hanlp_standard: 标准分词
hanlp_index: 索引分词
hanlp_nlp: NLP分词
hanlp_crf: CRF分词
hanlp_n_short: N-最短路分词
hanlp_dijkstra: 最短路分词
hanlp_speed: 极速词典分词

下载

https://github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/tag/v6.3.2

本机下载 elasticsearch-analysis-hanlp-6.3.2.zip

解压

目录名改为 analysis-hanlp

将 analysis-hanlp/data 文件夹下的内容复制到 {ES_HOME}/data 目录下
将 analysis-hanlp/config 文件夹下的内容复制到 {ES_HOME}/config/analysis-hanlp/ 目录下
将解压后的目录移动到 {ES_HOME}/plugins 目录下
重启 ES 服务

启动窗口加载日志会多出一些加载插件的信息：[YABPFPe] loaded plugin [analysis-hanlp]

测试：

GET /_analyze
{
  "text": "中华人民共和国国歌",
  "analyzer": "hanlp"
}

{
  "tokens": [
    {
      "token": "中华人民共和国",
      "start_offset": 0,
      "end_offset": 7,
      "type": "ns",
      "position": 0
    },
    {
      "token": "国歌",
      "start_offset": 0,
      "end_offset": 2,
      "type": "n",
      "position": 1
    }
  ]
}

GET /_analyze
{
  "text": "中华人民共和国国歌",
  "analyzer": "hanlp_index"
}

elasticsearch-analysis-pinyin

https://github.com/medcl/elasticsearch-analysis-pinyin

pinyin

下载

本机下载 elasticsearch-analysis-pinyin-6.3.2.zip

解压

目录名改为 analysis-pinyin

将解压后的目录移动到 {ES_HOME}/plugins 目录下
重启 ES 服务

启动窗口加载日志会多出一些加载插件的信息：[YABPFPe] loaded plugin [analysis-pinyin]

GET /_analyze
{
  "text": "中华民族",
  "analyzer": "pinyin"
}

{
  "tokens": [
    {
      "token": "zhong",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 0
    },
    {
      "token": "hua",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 1
    },
    {
      "token": "min",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 2
    },
    {
      "token": "zu",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 3
    },
    {
      "token": "zhmz",
      "start_offset": 0,
      "end_offset": 0,
      "type": "word",
      "position": 3
    }
  ]
}

在索引中的用法：

创建一个索引，包含自定义的分词器

PUT /index3/ 
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin"
                }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "type" : "pinyin",
                    "keep_separate_first_letter" : false,
                    "keep_full_pinyin" : true,
                    "keep_original" : true,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "remove_duplicated_term" : true
                }
            }
        }
    }
}

Test Analyzer, analyzing a chinese name, such as 刘德华

GET /index3/_analyze
{
  "text": ["刘德华"],
  "analyzer": "pinyin_analyzer"
}

Create mapping

POST /index3/_mapping 
{
    "properties": {
        "name": {
            "type": "keyword",
            "fields": {
                "pinyin": {
                    "type": "text",
                    "store": false,
                    "term_vector": "with_offsets",
                    "analyzer": "pinyin_analyzer",
                    "boost": 10
                }
            }
        }
    }
}

Indexing

POST /medcl/_create/andy
{"name":"刘德华"}

Let’s search

curl http://localhost:9200/medcl/_search?q=name:%E5%88%98%E5%BE%B7%E5%8D%8E
curl http://localhost:9200/medcl/_search?q=name.pinyin:%e5%88%98%e5%be%b7
curl http://localhost:9200/medcl/_search?q=name.pinyin:liu
curl http://localhost:9200/medcl/_search?q=name.pinyin:ldh
curl http://localhost:9200/medcl/_search?q=name.pinyin:de+hua

morningcat2018

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
elasticsearch系统学习笔记5-中文分词器

elasticsearch系统学习笔记5-中文分词器IKhttps://github.com/medcl/elasticsearch-analysis-ikAnalyzer: ik_smart , ik_max_wordTokenizer: ik_smart , ik_max_word下载下载地址 https://github.com/medcl/elasticsearch-analysis-ik/releases本机下载 elasticsearch-analysis-ik-6.3.2
复制链接

扫一扫