Elastic Search的中文分词插件ik使用过程

最新推荐文章于 2021-05-19 14:48:55 发布

能想多少想多少

最新推荐文章于 2021-05-19 14:48:55 发布

阅读量509

点赞数

分类专栏： BI可视化

本文链接：https://blog.csdn.net/weixin_42644062/article/details/102586218

版权

BI可视化专栏收录该内容

21 篇文章 0 订阅

订阅专栏

安装插件

按照官网步骤，用命令行安装了插件：
https://github.com/medcl/elasticsearch-analysis-ik/tree/v7.3.1

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip

创建一个index，并创建mapping

按照官方指导：

curl -XPUT http://localhost:9200/index

curl -XPOST http://localhost:9200/index/_mapping -H 'Content-Type:application/json' -d'
{
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_smart"
            }
        }

}'

插入文件

当进行插入文件操作的时候，死活也无法用curl往里面插入数据，我用的是cygwin下面的curl，错误提示信息：

$ curl -XPOST -H 'Content-Type:application/json'  http://localhost:9200/index/_create/4 -d '{"content":"中文行不行"}'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   567  100   543  100    24    543     24  0:00:01 --:--:--  0:00:01  7269
{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse field [content] of type [text] in document with id '4'. Preview of field's value: ''"}],"type":"mapper_parsing_exception","reason":"failed to parse field [content] of type [text] in document with id '4'. Preview of field's value: ''","caused_by":{"type":"json_parse_exception","reason":"Invalid UTF-8 middle byte 0xd0\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@56ebd434; line: 1, column: 15]"}},"status":400}

也就是说，cygwin下的curl无法传递中文信息。

最后的解决办法：在Kibana中的Dev Tools中：

Post index/_create/6
{
  "content": "中国是我的国家，长春是中国的城市"
}

结果成功

尝试高亮搜索：

Post index/_search
{
    "query" : { "match" : { "content" : "中国" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}

结果：

{
  "took" : 120,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 2.080443,
    "hits" : [
      {
        "_index" : "index",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 2.080443,
        "_source" : {
          "content" : "中国"
        },
        "highlight" : {
          "content" : [
            "<tag1>中国</tag1>"
          ]
        }
      },
      {
        "_index" : "index",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 1.980741,
        "_source" : {
          "content" : "中国是我的国家，上海是中国的城市"
        },
        "highlight" : {
          "content" : [
            "<tag1>中国</tag1>是我的国家，上海是<tag1>中国</tag1>的城市"
          ]
        }
      }
    ]
  }
}

尝试分词：

GET /index/_analyze
	{
	  "text": " 对于你，我始终只能以陌生人的身份去怀念。",
	  "analyzer": "ik_smart"
	}

结果：

{
  "tokens" : [
    {
      "token" : "对于",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "你",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "我",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "始终",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "只",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "能以",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "陌生人",
      "start_offset" : 11,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "的",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "CN_CHAR",
      "position" : 7
    },
    {
      "token" : "身份",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "去",
      "start_offset" : 17,
      "end_offset" : 18,
      "type" : "CN_CHAR",
      "position" : 9
    },
    {
      "token" : "怀念",
      "start_offset" : 18,
      "end_offset" : 20,
      "type" : "CN_WORD",
      "position" : 10
    }
  ]
}