Analyzer

analysis--文本分析是把全文本转化成一系列单词(term/tocken)的过程,也叫分词

analyzer --分词器是专门处理分词的组件,它有三部分组成

Character Filters(针对原始文本,比如去除html标签)

Tockenizer 按照规则切分为单词

Tocken Filter 将切分的单词进行加工,小写,删除stopwords(停用词)和增加同义词

es内置分词器

standard Analyzer 默认分词器,按词切分,小写处理

Stop Analyzer --小写处理,停用词过滤(the,a,is)

Simple analyzer 按照字母切分(符号被过滤掉),小写处理

whitespace Analyzer 按照空格切分,不转换小写

keyword Analyzer 不分词,直接将输入当做输出

Patter Analyzer --正则表达式,默认 \w+ (非字符串分隔)

Language --提供了30多种常见语言的分词器

三种常用的分词器API测试

1直接指定Analyze分词测试

GET _analyze
{
  "analyzer": "standard",
  "text": "Mastering English"
}

结果

{
  "tokens": [
    {
      "token": "mastering",
      "start_offset": 0,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "english",
      "start_offset": 10,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

2指定索引的字段进行测试

POST /book/_Analyze

{

"field":"title",

"test":"Mastering English"

}

3自定义分词器测试

POST /_Analyze

{

"Tockenizer":"standard",

"filter":["Lowcase"],

"test":"Mastering English"

}

 

不同种类分词器演示

GET _analyze
{
  "analyzer": "standard",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

结果

{
  "tokens": [
    {
      "token": "2",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<NUM>",
      "position": 0
    },
    {
      "token": "running",
      "start_offset": 2,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "quick",
      "start_offset": 10,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "brown",
      "start_offset": 16,
      "end_offset": 21,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "foxes",
      "start_offset": 22,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "leap",
      "start_offset": 28,
      "end_offset": 32,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "over",
      "start_offset": 33,
      "end_offset": 37,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "lazy",
      "start_offset": 38,
      "end_offset": 42,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "dogs",
      "start_offset": 43,
      "end_offset": 47,
      "type": "<ALPHANUM>",
      "position": 8
    },
    {
      "token": "in",
      "start_offset": 48,
      "end_offset": 50,
      "type": "<ALPHANUM>",
      "position": 9
    },
    {
      "token": "the",
      "start_offset": 51,
      "end_offset": 54,
      "type": "<ALPHANUM>",
      "position": 10
    },
    {
      "token": "summer",
      "start_offset": 55,
      "end_offset": 61,
      "type": "<ALPHANUM>",
      "position": 11
    },
    {
      "token": "evening",
      "start_offset": 62,
      "end_offset": 69,
      "type": "<ALPHANUM>",
      "position": 12
    }
  ]
}

 

GET _analyze
{
  "analyzer": "simple",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

{
  "tokens": [
    {
      "token": "running",
      "start_offset": 2,
      "end_offset": 9,
      "type": "word",
      "position": 0
    },
    {
      "token": "quick",
      "start_offset": 10,
      "end_offset": 15,
      "type": "word",
      "position": 1
    },
    {
      "token": "brown",
      "start_offset": 16,
      "end_offset": 21,
      "type": "word",
      "position": 2
    },
    {
      "token": "foxes",
      "start_offset": 22,
      "end_offset": 27,
      "type": "word",
      "position": 3
    },
    {
      "token": "leap",
      "start_offset": 28,
      "end_offset": 32,
      "type": "word",
      "position": 4
    },
    {
      "token": "over",
      "start_offset": 33,
      "end_offset": 37,
      "type": "word",
      "position": 5
    },
    {
      "token": "lazy",
      "start_offset": 38,
      "end_offset": 42,
      "type": "word",
      "position": 6
    },
    {
      "token": "dogs",
      "start_offset": 43,
      "end_offset": 47,
      "type": "word",
      "position": 7
    },
    {
      "token": "in",
      "start_offset": 48,
      "end_offset": 50,
      "type": "word",
      "position": 8
    },
    {
      "token": "the",
      "start_offset": 51,
      "end_offset": 54,
      "type": "word",
      "position": 9
    },
    {
      "token": "summer",
      "start_offset": 55,
      "end_offset": 61,
      "type": "word",
      "position": 10
    },
    {
      "token": "evening",
      "start_offset": 62,
      "end_offset": 69,
      "type": "word",
      "position": 11
    }
  ]
}

 

GET _analyze
{
  "analyzer": "stop",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

{
  "tokens": [
    {
      "token": "running",
      "start_offset": 2,
      "end_offset": 9,
      "type": "word",
      "position": 0
    },
    {
      "token": "quick",
      "start_offset": 10,
      "end_offset": 15,
      "type": "word",
      "position": 1
    },
    {
      "token": "brown",
      "start_offset": 16,
      "end_offset": 21,
      "type": "word",
      "position": 2
    },
    {
      "token": "foxes",
      "start_offset": 22,
      "end_offset": 27,
      "type": "word",
      "position": 3
    },
    {
      "token": "leap",
      "start_offset": 28,
      "end_offset": 32,
      "type": "word",
      "position": 4
    },
    {
      "token": "over",
      "start_offset": 33,
      "end_offset": 37,
      "type": "word",
      "position": 5
    },
    {
      "token": "lazy",
      "start_offset": 38,
      "end_offset": 42,
      "type": "word",
      "position": 6
    },
    {
      "token": "dogs",
      "start_offset": 43,
      "end_offset": 47,
      "type": "word",
      "position": 7
    },
    {
      "token": "summer",
      "start_offset": 55,
      "end_offset": 61,
      "type": "word",
      "position": 10
    },
    {
      "token": "evening",
      "start_offset": 62,
      "end_offset": 69,
      "type": "word",
      "position": 11
    }
  ]
}

 

#stop
GET _analyze
{
  "analyzer": "whitespace",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

 

{
  "tokens": [
    {
      "token": "2",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "running",
      "start_offset": 2,
      "end_offset": 9,
      "type": "word",
      "position": 1
    },
    {
      "token": "Quick",
      "start_offset": 10,
      "end_offset": 15,
      "type": "word",
      "position": 2
    },
    {
      "token": "brown-foxes",
      "start_offset": 16,
      "end_offset": 27,
      "type": "word",
      "position": 3
    },
    {
      "token": "leap",
      "start_offset": 28,
      "end_offset": 32,
      "type": "word",
      "position": 4
    },
    {
      "token": "over",
      "start_offset": 33,
      "end_offset": 37,
      "type": "word",
      "position": 5
    },
    {
      "token": "lazy",
      "start_offset": 38,
      "end_offset": 42,
      "type": "word",
      "position": 6
    },
    {
      "token": "dogs",
      "start_offset": 43,
      "end_offset": 47,
      "type": "word",
      "position": 7
    },
    {
      "token": "in",
      "start_offset": 48,
      "end_offset": 50,
      "type": "word",
      "position": 8
    },
    {
      "token": "the",
      "start_offset": 51,
      "end_offset": 54,
      "type": "word",
      "position": 9
    },
    {
      "token": "summer",
      "start_offset": 55,
      "end_offset": 61,
      "type": "word",
      "position": 10
    },
    {
      "token": "evening.",
      "start_offset": 62,
      "end_offset": 70,
      "type": "word",
      "position": 11
    }
  ]
}

 

#keyword
GET _analyze
{
  "analyzer": "keyword",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

 

{
  "tokens": [
    {
      "token": "2 running Quick brown-foxes leap over lazy dogs in the summer evening.",
      "start_offset": 0,
      "end_offset": 70,
      "type": "word",
      "position": 0
    }
  ]
}

 

GET _analyze
{
  "analyzer": "pattern",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

 

{
  "tokens": [
    {
      "token": "2",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "running",
      "start_offset": 2,
      "end_offset": 9,
      "type": "word",
      "position": 1
    },
    {
      "token": "quick",
      "start_offset": 10,
      "end_offset": 15,
      "type": "word",
      "position": 2
    },
    {
      "token": "brown",
      "start_offset": 16,
      "end_offset": 21,
      "type": "word",
      "position": 3
    },
    {
      "token": "foxes",
      "start_offset": 22,
      "end_offset": 27,
      "type": "word",
      "position": 4
    },
    {
      "token": "leap",
      "start_offset": 28,
      "end_offset": 32,
      "type": "word",
      "position": 5
    },
    {
      "token": "over",
      "start_offset": 33,
      "end_offset": 37,
      "type": "word",
      "position": 6
    },
    {
      "token": "lazy",
      "start_offset": 38,
      "end_offset": 42,
      "type": "word",
      "position": 7
    },
    {
      "token": "dogs",
      "start_offset": 43,
      "end_offset": 47,
      "type": "word",
      "position": 8
    },
    {
      "token": "in",
      "start_offset": 48,
      "end_offset": 50,
      "type": "word",
      "position": 9
    },
    {
      "token": "the",
      "start_offset": 51,
      "end_offset": 54,
      "type": "word",
      "position": 10
    },
    {
      "token": "summer",
      "start_offset": 55,
      "end_offset": 61,
      "type": "word",
      "position": 11
    },
    {
      "token": "evening",
      "start_offset": 62,
      "end_offset": 69,
      "type": "word",
      "position": 12
    }
  ]
}

 

#english
GET _analyze
{
  "analyzer": "english",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

{
  "tokens": [
    {
      "token": "2",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<NUM>",
      "position": 0
    },
    {
      "token": "run",
      "start_offset": 2,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "quick",
      "start_offset": 10,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "brown",
      "start_offset": 16,
      "end_offset": 21,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "fox",
      "start_offset": 22,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 4
    },
    {
      "token": "leap",
      "start_offset": 28,
      "end_offset": 32,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "over",
      "start_offset": 33,
      "end_offset": 37,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "lazi",
      "start_offset": 38,
      "end_offset": 42,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "dog",
      "start_offset": 43,
      "end_offset": 47,
      "type": "<ALPHANUM>",
      "position": 8
    },
    {
      "token": "summer",
      "start_offset": 55,
      "end_offset": 61,
      "type": "<ALPHANUM>",
      "position": 11
    },
    {
      "token": "even",
      "start_offset": 62,
      "end_offset": 69,
      "type": "<ALPHANUM>",
      "position": 12
    }
  ]
}

 

POST _analyze
{
  "analyzer": "icu_analyzer",
  "text": "他说的确实在理”"
}
没有安装

POST _analyze
{
  "analyzer": "standard",
  "text": "他说的确实在理”"
}

{
  "tokens": [
    {
      "token": "他",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "说",
      "start_offset": 1,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "的",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "确",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "实",
      "start_offset": 4,
      "end_offset": 5,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    },
    {
      "token": "在",
      "start_offset": 5,
      "end_offset": 6,
      "type": "<IDEOGRAPHIC>",
      "position": 5
    },
    {
      "token": "理",
      "start_offset": 6,
      "end_offset": 7,
      "type": "<IDEOGRAPHIC>",
      "position": 6
    }
  ]
}

 

POST _analyze
{
  "analyzer": "icu_analyzer",
  "text": "这个苹果不大好吃"
}

Ik分词器

POST _analyze
{
  "analyzer": "ik_smart",
  "text": "中华人民共和国国歌"
}

POST _analyze
{
  "analyzer": "ik_max_word",
  "text": "中华人民共和国国歌"
}

ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合,适合 Term Query;

ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”,适合 Phrase 查询

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值