Elastic Search的中文分词插件ik使用过程

安装插件

按照官网步骤,用命令行安装了插件:
https://github.com/medcl/elasticsearch-analysis-ik/tree/v7.3.1

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip

创建一个index,并创建mapping

按照官方指导:

curl -XPUT http://localhost:9200/index
curl -XPOST http://localhost:9200/index/_mapping -H 'Content-Type:application/json' -d'
{
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_smart"
            }
        }

}'

插入文件

当进行插入文件操作的时候,死活也无法用curl往里面插入数据,我用的是cygwin下面的curl,错误提示信息:

$ curl -XPOST -H 'Content-Type:application/json'  http://localhost:9200/index/_create/4 -d '{"content":"中文行不行"}'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   567  100   543  100    24    543     24  0:00:01 --:--:--  0:00:01  7269
{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse field [content] of type [text] in document with id '4'. Preview of field's value: ''"}],"type":"mapper_parsing_exception","reason":"failed to parse field [content] of type [text] in document with id '4'. Preview of field's value: ''","caused_by":{"type":"json_parse_exception","reason":"Invalid UTF-8 middle byte 0xd0\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@56ebd434; line: 1, column: 15]"}},"status":400}

也就是说,cygwin下的curl无法传递中文信息。

最后的解决办法:在Kibana中的Dev Tools中:

Post index/_create/6
{
  "content": "中国是我的国家,长春是中国的城市"
}

结果成功

尝试高亮搜索:

Post index/_search
{
    "query" : { "match" : { "content" : "中国" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}

结果:

{
  "took" : 120,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 2.080443,
    "hits" : [
      {
        "_index" : "index",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 2.080443,
        "_source" : {
          "content" : "中国"
        },
        "highlight" : {
          "content" : [
            "<tag1>中国</tag1>"
          ]
        }
      },
      {
        "_index" : "index",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 1.980741,
        "_source" : {
          "content" : "中国是我的国家,上海是中国的城市"
        },
        "highlight" : {
          "content" : [
            "<tag1>中国</tag1>是我的国家,上海是<tag1>中国</tag1>的城市"
          ]
        }
      }
    ]
  }
}

尝试分词:

GET /index/_analyze
	{
	  "text": " 对于你,我始终只能以陌生人的身份去怀念。",
	  "analyzer": "ik_smart"
	}

结果:

{
  "tokens" : [
    {
      "token" : "对于",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "你",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "我",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "始终",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "只",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "能以",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "陌生人",
      "start_offset" : 11,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "的",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "CN_CHAR",
      "position" : 7
    },
    {
      "token" : "身份",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "去",
      "start_offset" : 17,
      "end_offset" : 18,
      "type" : "CN_CHAR",
      "position" : 9
    },
    {
      "token" : "怀念",
      "start_offset" : 18,
      "end_offset" : 20,
      "type" : "CN_WORD",
      "position" : 10
    }
  ]
}

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值