Elasticsearch分词

分词流程
  1. Character Filters 初始操作,如标签过滤等
  2. Tokenizer 按规进行切分,如按空格进行切分等
  3. Tokenizer Filters 二次加工,如大小写转换等

在这里插入图片描述

自带的分词

使用 GET /_analyze 查看分词结果

Standard

默认分词器,按词切分,小写处理

GET /_analyze
{
  "analyzer": "standard",
  "text":"1 A 2 b"
}

# result
{
  "tokens" : [
    {
      "token" : "1",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "a",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "2",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<NUM>",
      "position" : 2
    },
    {
      "token" : "b",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}
Simple

按照非字母进行切分(中文被当作字母),过滤符号,小写处理(中文还是中文)

GET /_analyze
{
  "analyzer": "simple",
  "text":"1 好 ef A-B 2 c"
}

# result
{
  "tokens" : [
    {
      "token" : "好",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ef",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "b",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "c",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "word",
      "position" : 4
    }
  ]
}

Stop

在Simple的基础上增加了停用词过滤(the,a,is)

GET /_analyze
{
  "analyzer": "stop",
  "text":"1 好 ef A-B is the  2 c"
}

# result
{
  "tokens" : [
    {
      "token" : "好",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ef",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "b",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "c",
      "start_offset" : 21,
      "end_offset" : 22,
      "type" : "word",
      "position" : 6
    }
  ]
}

Whitespace

按空格切分,不转小写

GET /_analyze
{
  "analyzer": "whitespace",
  "text":"1 好 ef A-B is the  2 c"
}

# result
{
  "tokens" : [
    {
      "token" : "1",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "好",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "ef",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "A-B",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "is",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "the",
      "start_offset" : 14,
      "end_offset" : 17,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "2",
      "start_offset" : 19,
      "end_offset" : 20,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "c",
      "start_offset" : 21,
      "end_offset" : 22,
      "type" : "word",
      "position" : 7
    }
  ]
}

Keyword

不分词

GET /_analyze
{
  "analyzer": "keyword",
  "text":"1 好 ef A-B is the  2 c"
}

#result
{
  "tokens" : [
    {
      "token" : "1 好 ef A-B is the  2 c",
      "start_offset" : 0,
      "end_offset" : 22,
      "type" : "word",
      "position" : 0
    }
  ]
}

Patter

按正则分词,默认\W+

GET /_analyze
{
  "analyzer": "pattern:\d+",
  "text":"1 好 ef A-B is the  2 c"
}

# result
{
  "tokens" : [
    {
      "token" : "1",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ef",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "b",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "is",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "the",
      "start_offset" : 14,
      "end_offset" : 17,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "2",
      "start_offset" : 19,
      "end_offset" : 20,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "c",
      "start_offset" : 21,
      "end_offset" : 22,
      "type" : "word",
      "position" : 7
    }
  ]
}

中文分词

从 https://github.com/medcl/elasticsearch-analysis-ik 选择对应自己Elasticsearch的版本。我的当前es版本为v7.1.0,那么插件也选7.1.0

 ./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.1.0/elasticsearch-analysis-ik-7.1.0.zip

如果是es集群,请为每一台都安装ik插件,否则有可能导致kibana异常

GET /_analyze
{
  "analyzer": "ik_smart",
  "text":"苹果电脑是比较适合程序员的电脑"
}
# result
{
  "tokens" : [
    {
      "token" : "苹果电脑",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "是",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 1
    },
    {
      "token" : "比较",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "适合",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "程序员",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "的",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "CN_CHAR",
      "position" : 5
    },
    {
      "token" : "电脑",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "CN_WORD",
      "position" : 6
    }
  ]
}

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值